A multi-task information extraction Chinese dataset for APT cyber threat intelligence - Nature
NatureArchived Jun 17, 2026✓ Full text saved
A multi-task information extraction Chinese dataset for APT cyber threat intelligence Nature
Full text archived locally
✦ AI Summary· Claude Sonnet
Abstract
Advanced Persistent Threats(APTs) are characterized by persistence and complex attack chains. Information extraction techniques enable the identification of critical knowledge from unstructured Cyber Threat Intelligence (CTI), improving the detection of APT attacks. At present, high-quality information extraction Chinese datasets for APT scenarios remain scarce, particularly those covering multiple tasks such as entity, relation, and event extraction. This shortage limits the training and performance improvement of detection models. To address this issue, a multi-task information extraction Chinese dataset for APT Cyber Threat Intelligence is proposed. The dataset complies with the STIX 2.1 standard and is derived from 116 CTI reports. It covers three tasks: entity, relation, and event extraction. Specifically, it includes 2,574 entities, 1,506 relations, and 139 event instances across 808 sentences. Compared with existing APT threat intelligence datasets, our dataset offers significant advantages in task coverage, annotation granularity, and structural hierarchy. The dataset is further validated using several baseline models. This work provides strong support for APT intelligence modeling and cybersecurity research.
Similar content being viewed by others
Integrating AI in security information and event management for real time cyber defense
Article Open access
14 October 2025
BERT-spaCy hybrid NLP and blockchain-enhanced adaptive CTI for IOC extraction and threat prediction
Article Open access
02 March 2026
Enhancing security information and event management with W2V-BERT-based real-time threat detection
Article Open access
28 April 2026
Background & Summary
Advanced Persistent Threats (APTs) refer to a sophisticated, long-term cyber attack orchestrated by well-funded and organized threat actors1. Unlike opportunistic attacks, APTs exhibit three core characteristics: advanced, involving diverse and stealthy techniques to evade detection; persistent, characterized by prolonged system infiltration to maintain control; and threat, aiming to steal critical sensitive information and inflict significant damage2. On February 11, 360 Digital Security Group released the 2024 Global Advanced Persistent Threat Research Report, which reported over 1,300 APT incidents targeting China throughout the year, severely affecting 14 key domestic industry sectors3. Providing high-quality cyber threat intelligence(CTI) to decision-makers is critical when formulating defense strategies against APT groups. Such intelligence not only helps security analysts gain deeper insights into attackers’motivations, techniques, and objectives, but also significantly improves incident response speed, enabling defenders to proactively address potential risks4.
In recent years, Chinese cybersecurity companies have developed independent and robust capabilities in detecting and tracking APT attacks5. Most domestic security vendors, including ThreatBook, Tencent Security, Qi An Xin, and the 360 Threat Intelligence Center, have established proprietary and innovative CTI sharing platforms. Leveraging strong technical expertise and extensive data resources, these companies provide capabilities such as early warning of APT campaigns, attribution analysis, deconstruction of attack techniques, reconstruction of attack chains, and deployment of subsequent defense measures. As illustrated in Fig. 1, the 360 Threat Intelligence Center produces reports that cover organizational background, attack activity analysis, attribution analysis, proposed measures, and attached IOCs. When APT incidents affect domestic victims, these vendors deliver more comprehensive and in-depth analyses than their international counterparts. These capabilities underscore the necessity of Chinese CTI research. At present, corpora of Chinese CTI specific to APTs remain scarce. Compared with English CTI, Chinese CTI resources are limited in quantity, and the lack of open-source datasets further constrains research progress in this domain. In addition, APT domain text differs significantly from general-domain Chinese text, as it contains a large number of domain-specific terms that are rarely covered in general pre-training corpora. Furthermore, APT text often involves a mixture of Chinese and English, posing additional challenges for learning models in semantic understanding and representation. Therefore, developing and constructing high-quality Chinese APT threat intelligence corpora is of significant importance for advancing both academic research and technological development in this field.
Fig. 1
The alternative text for this image may have been generated using AI.
Full size image
Overview of cyber threat intelligence on a specific APT group.
CTI is typically shared in the form of unstructured text and exists in large volumes. At present, it is largely analyzed manually by experts, highlighting the urgent need for automated techniques to extract information from unstructured or semi-structured CTI. Consequently, cybersecurity researchers are dedicated to developing high-precision, end-to-end information extraction models to extract key information from unstructured text6. Early studies primarily relied on traditional machine learning methods. Mulwad et al.7 developed a system based on an SVM classifier to extract vulnerabilities, attack techniques, and threat concepts from web text, but its recognition capability was limited. Shafiq et al.8 proposed the CorrAUC method, which detects malicious network traffic and extracts key information through feature selection and classification models. Joshi et al.9 employed a Conditional Random Field (CRF) model to identify security entities such as software products and operating systems. With the development of deep learning, CTI extraction methods have achieved significant progress. Gyeongmin Kim et al.10 proposed a BiLSTM-CRF model that automatically extracts entities such as malware families and IP addresses. Zhen Zhen et al.11 designed a model for Chinese CTI tasks that integrates RoBERTa-wwm, Residual Dilated Convolutional Neural Networks (RDCNN), and CRF, effectively improving entity extraction performance. The CTI View system12 integrates BERT-GRU-BiLSTM-CRF, regular expressions, and the ATT&CK framework to enable automated analysis of APT-related text. The Vulcan system13 employs BERT-BiLSTM-CRF for entity extraction and uses a BERT model for relation extraction, with special markers and type masks to optimize outputs. In recent years, generative information extraction methods have gained increasing attention. For example, the DEGREE14 model reformulates event extraction as a conditional text generation task. It uses structured prompts to jointly extract triggers and their arguments. Although these methods have achieved some progress in CTI information extraction, high-quality datasets are still lacking in the cybersecurity domain.
Information extraction primarily consists of three subtasks: entity extraction, relation extraction, and event extraction15. Entity extraction identifies semantically meaningful entities from text, such as threat actor, industry, and malware. Relation extraction reveals semantic associations between entities, such as the link between a threat actor and the tools it employs, which can assist in tracing attack paths. Event extraction aims to extract events and their related elements from text. It includes two core stages: trigger identification and argument role extraction. In recent years, several studies have attempted to construct cybersecurity information extraction datasets, some of which include APT-related content. For entity extraction, researchers have developed several corpora specifically for the cybersecurity domain. For example, the DNRTI dataset proposed by Wang et al.16 covers 13 entity types, such as hacker groups and tools, and serves as an important benchmark for entity extraction research. The APTNER17 dataset complies with the Structured Threat Information Expression (STIX) 2.1 standard, contains 21 entity types, and is currently the largest dataset in this domain. Liu et al.5 introduced the Chinese APTTOOLNER dataset, which covers seven types of APT entities and is annotated through multiple expert review rounds to ensure quality; however, the dataset has not been made publicly available. For relation extraction, the objective is to identify semantic relationships between entities. Luo et al.18 constructed a knowledge base containing 60 relation types based on the STIX 2.0 standard, and, through distant supervision and manual verification, produced a high-quality dataset comprising 18 unique relation types. The TreatRE dataset released by Yang et al.19 annotates 12 categories of threat relationships. The CDTier dataset, developed by Zhou et al.20, is the first Chinese CTI dataset. It contains 100 CTI reports and covers both entity and relation extraction tasks, significantly advancing the development of Chinese threat intelligence resources. In addition, the CSKG4APT knowledge graph proposed by Ren et al.21 demonstrates the extraction and structuring of relation data. For event extraction, the goal is to identify security events and the roles involved. Satyapanich et al.22 proposed the CASIE system, which defines five typical categories of cybersecurity events: Attack. Databreach, Attack. Phishing, Attack. Ransom, Discover. Vulnerability, and Patch.Vulnerability. Xiang et al.23 developed an extraction approach based on BERT-BiGRU-CRF for Chinese APT events and constructed a Chinese APT event dataset. However, existing studies mainly focus on a single task, such as entity extraction, relation extraction, or event extraction, or only cover part of these tasks. There is still a lack of unified datasets that can support multiple information extraction tasks at the same time.
To address these challenges, this study constructs a multi-task information extraction Chinese dataset for APT Cyber Threat Intelligence—APTIE24. The dataset provides comprehensive data support for APT intelligence modeling, facilitates research on multi-task collaboration, and enhances both the overall performance and practical effectiveness of information extraction. The main contributions of this work are as follows:
(1)
We construct the first multi-task information extraction dataset for Chinese APT intelligence scenarios. The dataset is derived from 116 CTI reports. It covers three tasks: entity, relation, and event extraction. Specifically, it includes 2,574 entities, 1,506 relations, and 139 event instances across 808 sentences. Compared to existing datasets, APTIE fills the gap in Chinese-language resources for multi-task information extraction in APT intelligence scenarios.
(2)
Based on the STIX 2.1 standard and the characteristics of APT attacks, we systematically define the entity types and relation categories, establish a unified annotation guideline, and perform manual annotations. The annotation process incorporates iterative refinements informed by baseline model results to ensure the validity, consistency, and relative balance among categories. APTIE provides essential data support for the analysis of diverse threat elements, and promotes the application of natural language processing(NLP) techniques in cybersecurity research and practice.
Methods
Data collection
To support information extraction tasks for APT threat intelligence in the cybersecurity domain, a multi-task dataset annotated with entities, relations, and events is required. CTI can be obtained from various sources, including technical forums, blogs, news media, government reports, cybersecurity companies, and in-house security teams4. In this study, data were collected from the official website of the 360 Advanced Threat Intelligence Center (https://apt.360.net/timeline) and its official WeChat account, “360 Threat Intelligence Center”. These platforms regularly publish high-quality threat intelligence reports, and all content is publicly accessible. The collected reports span from March 25, 2019 to February 19, 2025. Reports from 2019 to 2022 are obtained from the official website, while reports after 2022 are sourced from the official WeChat account. The dataset covers 35 APT groups. Since the threat intelligence reports are compiled and written by the same team, consistency and reliability are well maintained. This helps ensure the accuracy of the information. It also provides semantic advantages. Specifically, terminology and contextual expressions are more standardized, which reduces ambiguity and inconsistency. This facilitates subsequent data processing and annotation. We developed custom code modules to segment the text into individual sentences, enabling subsequent language processing and analysis. CTI reports often contain long and syntactically complex sentences. To preserve the linguistic characteristics of real-world CTI texts, our dataset construction process keeps these sentences in their original form. We do not remove, simplify, or shorten them during preprocessing. Finally, we extracted 808 unique sentences from the 116 reports, totaling 52,631 characters.
Sentence-level annotation
To ensure annotation consistency and clear evaluation granularity, each sentence is treated as an independent annotation unit. In the dataset, the longest sentence contains 243 characters, the shortest contains 13 characters, and the average sentence length is 65.14 characters. As shown in Fig. 2, most sentences fall within the 0–100 character range, and relatively few sentences exceed 100 characters. Although a small number of longer sentences exist, the dataset does not include paragraph-level or document-level long texts. Therefore, all information extraction tasks in this study are defined at the sentence level. Entities, relations, and events are identified within sentence boundaries. Cross-sentence aggregation and document-level reasoning are beyond the scope of this dataset and the baseline framework.
Fig. 2
The alternative text for this image may have been generated using AI.
Full size image
Distribution of sentence lengths in the dataset.
Definition of entities, relations, and events
We first predefine the types of entities, relations, and events. These definitions must capture the key characteristics and behavioral elements of APT attacks while aligning, as much as possible, with common threat intelligence modeling standards to ensure the usability and interoperability of subsequent extraction results. To this end, we adopt the STIX25 standard as a reference for modeling CTI. STIX is a language and serialization format used to exchange CTI. STIX enables organizations to share CTI in a consistent and machine-readable manner. STIX is a connected graph composed of nodes and edges: the nodes are STIX Domain Objects(SDOs) and STIX Cyber-observable Objects(SCOs), and the edges are STIX Relationship Objects(SROs). SDOs are core object types in STIX that represent various entities and concepts in the CTI domain. They represent key elements of threat intelligence, such as threat actor, campaign, and malware. SCOs are used to describe directly observable events, states, or characteristics in a cyber environment, including types such as account, artifact, directory, and file. SROs are used to describe the relationships between SDOs and SCOs. The STIX standard has evolved from its initial STIX 1.x versions to the current STIX 2.1, encompassing a total of 18 SDOs. In practice, CTI cases that strictly adhere to the standard specifications are relatively rare. Based on the STIX 2.1 standard, we define six entity types, whose mappings to SDOs are shown in Table 1. Considering the characteristics of the collected Chinese threat intelligence, we additionally introduce “Social Medium” as a new entity type. The detailed definitions of the seven entity types are provided in Table 2. The seven entity types we propose are derived from the core object definitions of the STIX 2.1 standard and are further adapted to the linguistic features and expression patterns of Chinese threat intelligence corpora, enabling precise representation of key information elements in APT attacks. Six of these entity types directly correspond to the primary SDOs in STIX, enabling a complete description of the full attack chain from the attacker to the attack methods and targets. The additional “Social Medium” entity addresses the limitations of STIX in modeling interpersonal communication media—such as social platforms and instant messaging tools—in Chinese threat intelligence scenarios, thereby providing a more accurate representation of contexts like social engineering attacks and targeted phishing.
Table 1 Mapping between APTIE and SDO.
Full size table
Table 2 Overview of predefined entity types.
Full size table
The existing definitions of entity types form relatively isolated nodes that contain only descriptive information about the characteristics of APT organizations, lacking semantic associations between nodes. The absence of structured semantic relations hinders the semantic-level retrieval and reasoning analysis of APT organizations. To address this issue, we summarize and propose a set of semantic relations tailored for knowledge modeling of APT organizations, aimed at enhancing the connectivity and inferential ability among knowledge elements. In the STIX 2.1 framework, relation types are designed to capture the logical associations between different STIX objects, such as “uses”, “targets”, and “originates-from”. These relations cover semantic links among the 18 SDOs, including threat actor, attack pattern, malware, and tool. By explicitly defining the semantic links between objects, these relation types enable the connectivity of the object network. Inspired by the STIX framework, this study predefines six relation types, each with several subcategories, as shown in Table 3.
Table 3 Overview of predefined relation types.
Full size table
In modeling and analyzing APT events, each event is typically characterized by its type and the roles of participating entities. The event type specifies the category of the incident, such as phishing, data leakage, or ransomware attacks. The roles represent the participating entities, including attackers, victims, and tools. Following the attack-stage framework proposed by Yang et al.26, the types of APT events were systematically organized according to the attack lifecycle. Based on this framework, representative keywords were selected for each event type, and a corresponding event-specific keyword dictionary was constructed. Subsequently, a statistical approach was applied to match and analyze the frequency of these keywords in threat intelligence texts, which enabled the identification of frequently occurring and representative event types. Finally, five major types of APT events were identified. For each type, common argument roles—such as Attack Group, Malware, and Attack Tool—were extracted using predefined entities. Together, these elements formed a structured representation of APT events, with event types and roles summarized in Table 4.
Table 4 APT event types and argument roles.
Full size table
In summary, to construct a high-quality and reliable dataset for APT threat intelligence, a systematic ontology was proposed that defines entities, relations, and events centered on the key elements of APT attacks. This ontology is designed to capture the behavioral characteristics of APTs in Chinese contexts while remaining strictly aligned with the STIX 2.1 standard, thereby enabling structured and standardized representations of threat intelligence. This design not only provides a well-defined scope for information extraction tasks but also establishes the theoretical foundation and annotation guidelines for subsequent data labeling and model training.
Annotation tools and methods
This study followed a pipeline process to perform three types of annotation tasks. Among them, entity annotation served as the foundation for relation and event annotation, and thus required particular attention. In general domains, the semantics of entities are usually well-defined and easy to interpret. However, in the cybersecurity domain, annotators may still encounter confusion over certain concepts during annotation, even though the predefined entity labels themselves are clearly defined. To improve annotation quality, a closed-loop mechanism was adopted, consisting of pre-annotation, collective review, rule consolidation, experimental validation, and rule refinement. Specifically, 100 samples were first pre-annotated, followed by a focused discussion with the annotation team. The annotation team, consisting of four graduate students, reviewed issues encountered during annotation through group meetings and clarified the key aspects that required attention. Based on this discussion, an initial annotation guideline was established, and formal annotation was carried out. During the formal annotation stage, experiments were conducted in parallel to validate the reliability of the dataset. The annotation rules were iteratively refined based on experimental results to improve consistency and accuracy. Compared with traditional one-off rule definition, the progressive, iterative approach adopted in this study offers clear advantages in specialized domains. The key considerations for entity annotation are summarized as follows:
1)
Annotators often confuse social media with tools. A social medium is defined as software disguised as a legitimate application. Such media lure users into execution by imitating the interface or functionality of legitimate software (e.g., video players, office tools). In essence, they serve as the physical medium of a social engineering attack. A tool is defined as legitimate software that contains embedded malicious code. Such tools are originally fully functional commercial or open-source software, but become part of the attack chain due to the exploitation of vulnerabilities or supply chain compromise.
2)
To ensure concise and focused annotation, non-essential information is not emphasized. Therefore, suffixes such as “region,” “organization,” or “institution” are not annotated.
3)
The same type of entity should be annotated consistently across different samples. Potential inconsistencies are identified and corrected through self-checking and cross-validation.
4)
Overlapping annotations are not allowed. Annotators must determine, based on the definitions, whether the text should be treated as a single entity or divided into multiple sub-entities.
We used the open-source annotation tool Doccano27 to manually annotate entities, relations, and events in sentences. Compared with similar annotation tools, Doccano supports multi-task annotation, provides compatibility with Chinese text, offers easy deployment, and benefits from an active community. Other tools often suffer from limitations such as supporting only a single task, poor compatibility with Chinese, cumbersome installation and maintenance, or infrequent updates. These drawbacks make them inadequate for our requirement of multi-task annotation in Chinese. Figure 3 illustrates an example of entity and relation annotation using this tool. In the tool interface, different types of entities are highlighted in distinct colors with their labels displayed below each entity. Relations between entities are shown as connecting lines, with the relation type annotated on the line. During event-type annotation, we followed a similar procedure to entity annotation. The event type was labeled below the entity, while all other entity-type information, except for argument roles, was removed.
Fig. 3
The alternative text for this image may have been generated using AI.
Full size image
Annotation example.
To meet the input requirements of downstream models, the annotated data were converted into the required formats. For the entity extraction task, the JSONL-format data generated by Doccano was converted into standard BIO-format text files to ensure compatibility with mainstream sequence labeling models. Figure 4 presents a comparison between the original JSONL format and the converted BIO format. For the event extraction task, we followed the annotation scheme of the DuEE28 dataset, which was released by Baidu in 2020. DuEE is a large-scale Chinese event extraction dataset. Its annotation scheme facilitates the flexible definition of diverse event types and their corresponding argument roles.
Fig. 4
The alternative text for this image may have been generated using AI.
Full size image
Transformation of annotated data format.
Quality assessment
During entity annotation, subjective differences among annotators in defining entity types may lead to conflicts in the annotation results. To ensure annotation quality, once a certain number of samples had been annotated, strict manual validation was conducted based on the principle of majority voting. This process corrected potential errors and aligned annotators’ understanding, thereby improving the accuracy and consistency of the annotations. In addition, Fleiss’s Kappa coefficient was employed as the key metric to measure inter-annotator agreement. The Kappa coefficient ranges from 0 to 1, with higher values indicating greater consistency among the annotations. According to prior studies, a Kappa value above 0.8 is generally considered to represent near-perfect agreement among annotators.
$$k=\frac{{p}_{o}-{p}_{e}}{1-{p}_{e}}$$ (1)
$${p}_{o}=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\frac{1}{n\left(n-1\right)}\left[\left(\mathop{\sum }\limits_{j=1}^{k}{n}_{{ij}}^{2}\right)-n\right]$$ (2)
$${p}_{e}=\mathop{\sum }\limits_{j=1}^{k}{\left(\frac{1}{{Nn}}\mathop{\sum }\limits_{i=1}^{N}{n}_{{ij}}\right)}^{2}$$ (3)
Let the total number of samples be N, with each sample annotated by n annotators, and the total number of entity categories be k. Here, nij denotes the number of annotations assigning the i-th sample to the j-th category. The observed agreement probability po is calculated as the average concentration of annotations for each sample, reflecting the overall agreement among annotators. The expected agreement probability pe is computed based on the frequency of each category across all samples, representing the probability of agreement under purely random conditions.
Figure 5 illustrates the effect of the number of annotated samples at different stages on the Kappa coefficient. In the first round of annotation, the observed Kappa coefficient was relatively high due to the limited number of samples (10) and the relatively simple sentence structures involved. However, when the number of samples increased to 50, the diversity and complexity of the sentences also increased. This could lead to differences in annotators’ understanding of domain-specific entities and annotation rules, thereby increasing the variability in annotation results.
Fig. 5
The alternative text for this image may have been generated using AI.
Full size image
Results of consistency evaluation.
As the number of annotated samples continued to grow and the annotation rules were gradually standardized, the Kappa coefficient showed a rising trend. When the total number of annotated samples reached 500, the rate of increase in the Kappa coefficient began to slow, indicating that the cognitive differences among annotators were gradually diminishing and approaching consistency. Furthermore, a Kappa coefficient exceeding 0.8 indicates that the annotated dataset has high reliability, providing a solid foundation for subsequent research and applications.
Data statistics
In this study, we conducted a statistical analysis of entities, relations, and events in the dataset, with results summarized in Table 5. The dataset contains a total of 2,574 annotated entities, averaging 3.2 entities per sentence. Among these, Attack Groups appear most frequently as they are the core planners of attacks, active across multiple stages, easily traceable, and highly scrutinized. In contrast, Attack Tools are the least frequent due to their concealed nature and limited focus in recorded data, while the distribution of other entity types is relatively balanced. From 808 unique sentences, 1,506 relations were extracted, averaging 1.9 triples per sentence. Although the Related to relation occurs infrequently, its semantic significance should not be overlooked. The constructed dataset contains 808 sentences, of which 139 are annotated as valid event-triggering sentences, accounting for 17.2%. This distribution is due to the textual characteristics of APT attack reports, which often include background descriptions, technical explanations, and non-event statements. Although such content is semantically relevant, it does not meet the formal definition required for event extraction tasks. Despite the limited overall size, the dataset covers the main argument roles. Weaponization events involving open-source tools are rare due to high technical barriers, insufficient stealth, and the availability of alternative methods. Malicious document exploitation events occur frequently because of their wide distribution and reliance on user interaction, increasing attack success rates. Malware delivery events are common due to diverse techniques and significant impact. Social engineering attacks are frequent as they exploit human weaknesses, are low-cost, and highly practical. Phishing attacks are common due to their broad targets and simple execution. Finally, the dataset was split into training, validation, and test sets in an 8:1:1 ratio.
Table 5 Frequency statistics of entities, relations, and event types in APT attacks.
Full size table
Data Records
The dataset APTIE24 is publicly available on Zenodo (https://doi.org/10.5281/zenodo.17129303). It is designed for Chinese-language information extraction tasks in APT threat intelligence and covers three modules: entity recognition, relation extraction, and event extraction, as detailed below:
1.
Entity Module: Provides JSONL files exported from the annotation tool, as well as entity annotation files following the BIO encoding standard, facilitating flexible use across different model architectures.
2.
Relation Module: Contains detailed annotations of relations between entities and their type definitions, provided in a structured CSV format.
3.
Event Module: Includes both JSONL files exported from the annotation tool and event annotation files in the standard format consistent with the DuEE dataset, supporting training and evaluation of mainstream event extraction methods.
All three sub-tasks are split into training, validation, and test sets, ensuring scientific rigor and reproducibility in model development and performance evaluation.
Technical Validation
In this section, we implemented and evaluated several representative baseline models on APTIE, including conventional deep learning models and pre-trained models. For the pre-trained models, we selected BERT Base Chinese, RoBERTa-wwm-ext, and MacBERT. These models were BERT-based models29 pre-trained specifically on Chinese text and available in the Hugging Face model repository. In addition, for the event extraction task, we further introduced the DEGREE model, a representative generative approach, to explore its capability in modeling complex event structures. Experiments were conducted on an NVIDIA GeForce RTX 4050 GPU using the PyTorch 1.8.1 framework to systematically evaluate the performance of multiple downstream tasks supported by APTIE. To ensure reproducibility, we used fixed random seeds. The model parameter configurations for the three tasks were presented in Tables 6–8, respectively. The experimental results showed that the entity extraction task achieved a maximum F1 score of 0.90, and relation extraction reached 0.95, demonstrating excellent performance. For event extraction, the highest F1 score for trigger recognition was 0.66, while argument extraction reached 0.47. These tasks were evaluated as independent benchmarks for the dataset. Their purpose was to assess the quality and usability of the dataset rather than to study performance gains from joint modeling or module interactions.
Table 6 Hyperparameter settings for the entity extraction task.
Full size table
Table 7 Hyperparameter settings for the relation extraction task.
Full size table
Table 8 Hyperparameter settings for the event extraction task.
Full size table
Evaluation metrics
Three commonly used metrics in information extraction tasks are precision, recall, and F1 score. These metrics are widely adopted because they can effectively evaluate model performance in information extraction tasks, and are particularly suitable for the commonly imbalanced APT datasets. The F1 score is the harmonic mean of precision and recall, reflecting the model’s ability to balance accuracy with coverage of positive samples. It is particularly useful for evaluating performance under imbalanced class distributions. All F1 scores for the three tasks are calculated using micro-averaging. Before presenting the formulas, the basic concepts of these metrics are defined as follows: TP denotes the number of true positive predictions, TN the number of true negative predictions, FP the number of false positive predictions, and FN the number of false negative predictions. Their calculations are as follows:
$${Precision}=\frac{{TP}}{{TP}+{FP}}$$ (4)
$${Recall}=\frac{{TP}}{{TP}+{FN}}$$ (5)
$$F1=\frac{2\times {Precision}\times {Recall}}{{Precision}+{Recall}}$$ (6)
Baseline models for APT entity extraction
In this study, we employed the DeepKE30 toolkit to train and compare two models, BERT and BiLSTM + CRF, for the threat intelligence entity extraction task. Guillaume Lample et al.31 were the first to apply neural networks to sequence labeling tasks, pioneering the use of BiLSTM to capture contextual representations and introducing CRF at the output layer to model dependencies between labels. This architecture, due to its powerful contextual modeling capability, quickly gained widespread adoption in sequence labeling tasks. In our experiments, we selected the BiLSTM + CRF model for validation, with its structure illustrated in Fig. 6. BERT is a pre-trained language model built upon the Transformer architecture32. The Transformer architecture endows BERT with strong parallel computing capabilities, enabling efficient handling of long-range dependencies and the learning of rich linguistic features. This capability allows BERT to achieve outstanding performance across various NLP tasks. In this study, we utilized multiple Chinese pre-trained language models to train APTIE. The training process is illustrated in Fig. 7.
Fig. 6
The alternative text for this image may have been generated using AI.
Full size image
BiLSTM–CRF model.
Fig. 7
The alternative text for this image may have been generated using AI.
Full size image
BERT model.
Baseline models for APT relation extraction
In this study, five models—CNN, RNN, GCN, BERT-Base-Chinese, and RoBERTa-wwm-ext—were trained on the APTIE dataset using DeepKE30. A pipeline-based extraction strategy was employed to evaluate the effectiveness of the dataset. In our experimental setting, relation extraction was performed under the condition that entity boundaries and entity types were correctly identified.
Convolutional Neural Networks (CNNs) were originally widely used in computer vision tasks. Owing to their strong capability in feature extraction, they have also been successfully applied to relation extraction in NLP33. The core idea is to represent an input sentence as a matrix of word embeddings. A one-dimensional convolution is then applied to capture local contextual features, followed by max-pooling to extract the most salient information. These features are finally fed into a fully connected layer or a classifier for relation classification.
Recurrent Neural Networks (RNNs) are a class of neural architectures designed for sequential data and are capable of modeling temporal dependencies within input sequences. In relation extraction, RNNs are used to model sentences containing entity pairs. They incrementally learn the semantic representation of each word in context, thereby capturing potential relational cues34.
Graph Convolutional Networks (GCNs) are neural architectures specifically designed for graph-structured data, capable of effectively modeling relationships between nodes. In relation extraction, GCNs typically construct a graph from the dependency parse tree of a sentence, where words are represented as nodes and dependency relations as edges35. GCNs propagate information across the graph and aggregate features from neighboring nodes, thereby generating word representations enriched with syntactic structure.
BERT-Base-Chinese is the original Chinese BERT model, whereas RoBERTa-wwm-ext is an enhanced variant that adopts whole word masking and improved pre-training strategies. Compared with BERT-Base-Chinese, RoBERTa-wwm-ext generally captures semantic information in Chinese text more effectively and achieves better performance on downstream tasks.
Baseline model for APT event extraction
In this study, three representative models—ERNIE, Table-Filling, and DEGREE—were employed for APT event extraction on the APTIE dataset. ERNIE is a knowledge-enhanced pre-trained language model that leverages rich semantic representations to capture deep contextual information, making it suitable for identifying complex event semantics in APT text. Table-Filling formulates event extraction as a structured prediction task by filling predefined tables with event-related information, thereby explicitly modeling the structural dependencies among event components. DEGREE adopts a generative approach and utilizes a sequence-to-sequence framework to generate structured event records from raw text, enabling flexible extraction of complex event structures. These three models represent different paradigms—knowledge-enhanced pre-trained modeling, structured prediction, and generative modeling—and together provide a comprehensive evaluation of the APTIE dataset for APT event extraction.
Main results
Multi-token entities are treated as a single entity span. A prediction is considered correct only if both the entity boundaries and the entity type exactly match the gold annotation. Partial matches at the token level are not counted as correct. Table 9 reports the experimental results of BiLSTM + CRF, BERT-Base-Chinese, RoBERTa-wwm-ext, and MacBERT on the APTIE dataset. The results show that BiLSTM + CRF slightly outperforms the three pre-trained models on our annotated dataset. This may be attributed to the limited size of the dataset. The pre-training plus fine-tuning paradigm of BERT typically yields superior performance on large-scale annotated corpora, whereas BiLSTM + CRF is more adaptable under small-sample conditions. Moreover, since cybersecurity corpora are not sufficiently represented in BERT’s pre-training data, the model may struggle to capture domain-specific entities and their contextual relations. In contrast, BiLSTM + CRF can more easily adapt to domain data and thus perform better on such tasks.
Table 9 Performance comparison of baseline models for entity extraction.
Full size table
Table 10 shows that the BiLSTM + CRF model performs well on most entity types, such as Location, Attack Group, and Targeted Industry. This indicates that the model is effective in recognizing entities with clear semantics and relatively fixed expression patterns. However, the performance drops significantly for Attack Tool and Malware.Attack Tool refers to legitimate software embedded with malicious code. Malware refers to a tactic, technique, and procedure (TTP) that represents malicious code. In real texts, these entities often mix Chinese and English, have long names, and are semantically ambiguous. For example, they may include software names, version numbers, or functional descriptions. This makes it harder for the model to detect boundaries and learn semantic features. In contrast, entities such as Location, Attack Group, and Targeted Industry usually have clear contextual cues or fixed expression patterns. Therefore, the model achieves higher precision and recall on these types.In addition, the Attack Tool category has fewer training samples. This also limits the model’s learning performance. Overall, the results show that complex, ambiguous, and low-resource entity types (such as Attack Tool) require stronger semantic modeling or additional features to improve extraction performance.
Table 10 Entity extraction performance across entity types.
Full size table
Table 11 reports the performance of CNN, RNN, GCN, BERT-Base-Chinese, and RoBERTa-wwm-ext on the APTIE dataset. In the relation extraction task, GCN achieved the best overall performance. Its precision, recall, and F1 score were all close to 1, indicating a strong ability to identify positive instances while effectively reducing false positives. The strong performance of GCN may be attributed to its ability to explicitly model syntactic dependency structures, which helps capture structural relations between entities. CNN ranked second and also achieved high precision and recall. CNN also performed well because it is effective at extracting local contextual patterns and salient lexical features from sentences, which are often informative for relation extraction. In contrast, the RNN, BERT-Base-Chinese, and RoBERTa-wwm-ext models performed relatively poorly, with both precision and recall below 0.75. Based on these results, we selected the GCN model for subsequent experiments.
Table 11 Performance comparison of baseline models for relation extraction.
Full size table
From Table 12, it can be observed that the F1 scores for relations such as Includes and Related to are relatively low. This is primarily due to the limited number of training instances for these relations, indicating that data sparsity constrains the model’s ability to effectively extract such relationships.
Table 12 Relation extraction performance across relation types.
Full size table
Table 13 presents the performance comparison of baseline models for event extraction. For argument evaluation, a strict matching criterion is adopted, under which a predicted argument is considered correct only when both its boundaries and role type exactly match the gold annotation; partial matches are not considered correct. Among the three models, DEGREE achieves the best performance in trigger word detection, with an F1 score of 0.66, substantially outperforming ERNIE and Table-Filling. This result suggests that the generative framework of DEGREE is more effective in modeling semantic information and contextual dependencies for event trigger identification. In the APT event argument recognition task, however, the performance differences among the three models are relatively small. DEGREE also achieves the highest F1 score, slightly outperforming Table-Filling and ERNIE.
Table 13 Performance comparison of baseline models for event extraction.
Full size table
Impact of Annotation Design on Extraction Performance
To study the impact of annotation design on extraction performance, we selected the best-performing BiLSTM + CRF model as our baseline and compared two annotation schemes. The coarse-grained scheme labels all attack media as a single entity type, Attack Medium. The fine-grained scheme further divides this type into Social Medium and File Medium. Models trained under both schemes were evaluated using the same settings. As shown in Table 14, the fine-grained scheme yields higher precision and F1 scores for entity recognition, suggesting that this separation reduces semantic ambiguity and enables the model to better leverage contextual information. Overall, this leads to improved extraction performance and supports the appropriateness of our annotation design.
Table 14 Impact of annotation scheme on entity extraction performance.
Full size table
Case Study
On May 7, 2025, the 360 Threat Intelligence Center released a report titled “Analysis of the Latest Attack Activities by APT-C-51 (APT35).” This report was selected as the target for information extraction testing. Figure 8 illustrates the extraction results for a sentence from the report using the best-performing algorithm in the experiments. The input section presents a segment of text describing an APT-C-51 attack. The entity extraction component identifies entities in the text, such as Attack Group and File Medium. The relation extraction component analyzes the relationships between entities in the sentence. For example, the relation between “APT-C-51” and “LNK file” was identified as “attack group exploits file medium” with a confidence score of 0.99. This prediction is correct. It shows that predictions with higher confidence are usually more accurate. The event extraction component specifies event types, trigger words, argument roles, and arguments. For instance, in the “malicious document exploitation” event, the trigger word is “file,” and the role types include Attack Group and File Medium. The entire process demonstrates the algorithm’s efficiency and accuracy in text analysis. As shown by the experimental results in Fig. 8, the model trained on APTIE successfully extracted entities, relations, and events from the sentence, overall reflecting the validity of the APTIE dataset.
Fig. 8
The alternative text for this image may have been generated using AI.
Full size image
Example of information extraction.
Entity and relation extraction errors
For example, in the sentence: “我们捕获到 MuddyWater 组织分别使用伪装成 PDF 的可执行文件以及携带宏代码的 DOC 文件,来投递 UDPGangster 后门攻击载荷” (“We captured that the MuddyWater organization used executable files disguised as PDFs and DOC files carrying macro code to deliver the UDPGangster backdoor payload”), the model exhibits two typical errors. First, in the entity extraction stage, the phrase “伪装成 PDF 的可执行文件” (“executable files disguised as PDFs”) is incorrectly split into two independent entities, namely “PDF” and “可执行文件” (“executable file”), failing to capture the composite semantics of the attack carrier. Second, in the relation extraction stage, the Includes relationship between the File Medium and the Malware is missed, resulting in an incomplete representation of the attack structure. This issue stems from the model’s difficulty in recognizing complex compound entities and capturing implicit relations that require cross-phrase semantic reasoning.
Event extraction errors
For example, in the sentence “APT-C-26 (Lazarus) 组织使用武器化的 IPMsg 软件开展攻击活动” (“The APT-C-26 (Lazarus) group used a weaponized IPMsg software to carry out attack activities”), the event “Weaponization of Open-source Tools” is sometimes incorrectly identified by the model as a “Malware Delivery” event. Specifically, the model tends to confuse the “weaponization” stage with the “delivery” stage, resulting in misclassification. We argue that such errors can be attributed to two main factors. First, action verbs such as “使用” (“use”) and entity information such as “软件” (“software”) in the sentence exhibit a certain degree of similarity to expression patterns in the “delivery” scenario, which may interfere with the model’s judgment and lead to misclassification. Second, the relatively limited number of training samples for this type of event constrains the model’s ability to fully learn the corresponding semantic patterns.
Data availability
The dataset APTIE is publicly available at Zenodo (https://doi.org/10.5281/zenodo.17129303).
Code availability
The data processing code and related baseline models are available on GitHub (https://github.com/sunllllll/CTI-dataset). These resources support the reproduction of the experiments reported in this study.
References
Gulbay, B. & Demirci, M. APT-scope: A novel framework to predict advanced persistent threat groups from enriched heterogeneous information network of cyber threat intelligence. Engineering Science and Technology, an International Journal 57, 101791 (2024).
Article
Google Scholar
Stojanović, B., Hofer-Schmitz, K. & Kleb, U. APT datasets and attack modeling for automated detection methods: A review. Computers & Security 92, 101734 (2020).
Article
Google Scholar
360 Advanced Threat Research Institute. 2024 Global Advanced Persistent Threat Research Report (2024).
Wu, P. et al. A survey of cyber threat intelligence processing methods. Journal of Sichuan University (Natural Science Edition) 60, (2023).
Liu, Y. et al. APTTOOLNER: A Chinese Dataset of Cyber Security Tool for NER Task. in 2023 3rd Asia-Pacific Conference on Communications Technology and Computer Science (ACCTCS) 368–373, https://doi.org/10.1109/ACCTCS58815.2023.00097 (2023).
Feng, J. & Gao, J. A survey of information extraction in the field of cyber threat intelligence. Computer Engineering (in press). https://doi.org/10.19678/j.issn.1000-3428.0070621 (2025).
Mulwad, V., Li, W., Joshi, A., Finin, T. & Viswanathan, K. Extracting Information about Security Vulnerabilities from Web Text. in 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology 3, 257–260 (2011).
Google Scholar
Shafiq, M., Tian, Z., Bashir, A. K., Du, X. & Guizani, M. CorrAUC: A Malicious Bot-IoT Traffic Detection Method in IoT Network Using Machine-Learning Techniques. IEEE Internet of Things Journal 8, 3242–3254 (2021).
Article
Google Scholar
Joshi, A., Lal, R., Finin, T. & Joshi, A. Extracting Cybersecurity Related Linked Data from Text. in 2013 IEEE Seventh International Conference on Semantic Computing 252–259, https://doi.org/10.1109/ICSC.2013.50 (2013).
Kim, G., Lee, C., Jo, J. & Lim,