Home / Papers / Large Language Model (LLM)-Driven Document Clustering: Improving Real-Time Security Intelligence...

Large Language Model (LLM)-Driven Document Clustering: Improving Real-Time Security Intelligence Extraction and Threat Analysis

DOI: 10.1109/ISI65680.2025.11201090Semantic Scholar

88 Citations•2025•

Patrick Serrano, Luwen Huangfu, Chunhua Liao

2025 IEEE International Conference on Intelligence and Security Informatics (ISI)

A novel Large Language Model-based document clustering process that generates a series of relevant keywords based on the content of the documents and leverages LLMs to cluster various datasets of intelligence information spanning different topics demonstrates that it aids in the clustering of unlabeled and unstructured textual data, and illustrates the potential for improving intelligence document clustering practices through the use of LLMs.

Abstract

The sheer volume and rapid expansion of unstructured text in various fields make it challenging for cybersecurity practitioners and researchers to extract essential information from documents. Ineficient clustering can hinder the timely extraction of intelligence, especially in areas such as network security, where real-time requirements are high. Currently, the use of descriptive tags for information clustering, although helpful, is not standardized within a given domain, causing useful information from other domains with different descriptive tags to be neglected. To address these issues, we introduce a novel Large Language Model (LLM)-based document clustering process that (1) generates a series of relevant keywords based on the content of the documents, and (2) leverages LLMs to cluster various datasets of intelligence information spanning different topics. We measure the quality of the clusters using several established indices: Overall Density, Overall Distinctiveness, Coherence and Overlap coefficients, and Label Entropy. Based on these evaluation metrics, the proposed approach performs better than traditional methods on multiple datasets in the field of Advanced Persistent Threat (APT). This demonstrates that it aids in the clustering of unlabeled and unstructured textual data, and illustrates the potential for improving intelligence document clustering practices through the use of LLMs in cybersecurity and other fields.