← Back ◬ AI & Machine Learning May 19, 2026

MalwarePT: A Binary-Level Foundation Model for Malware Analysis

arXiv Security Archived May 19, 2026 ✓ Full text saved

arXiv:2605.16455v1 Announce Type: new Abstract: Automated malware analysis increasingly relies on machine learning, yet most existing methods remain task-specific and depend on handcrafted features or narrowly scoped models. Recent developments in binary-level foundation models suggest a path toward reusable program representations, but their application to malware analysis remains underexplored, and most still operate at byte-level tokenization, limiting their ability to capture multi-byte code

Full text archived locally

✦ AI Summary · Claude Sonnet

Computer Science > Cryptography and Security [Submitted on 15 May 2026] MalwarePT: A Binary-Level Foundation Model for Malware Analysis Saastha Vasan, Yuzhou Nie, Kaie Chen, Yigitcan Kaya, Hojjat Aghakhani, Roman Vasilenko, Wenbo Guo, Christopher Kruegel, Giovanni Vigna Automated malware analysis increasingly relies on machine learning, yet most existing methods remain task-specific and depend on handcrafted features or narrowly scoped models. Recent developments in binary-level foundation models suggest a path toward reusable program representations, but their application to malware analysis remains underexplored, and most still operate at byte-level tokenization, limiting their ability to capture multi-byte code patterns. In this work, we introduce MalwarePT, a binary-level foundation model for malware analysis built on a ModernBERT-style encoder and pretrained with masked language modeling on Windows PE code-section bytes. We study whether a single pretrained encoder can transfer across malware-analysis tasks at different granularities, and how tokenization design affects that transfer. We train a byte-pair encoding (BPE) tokenizer on code-section bytes to compress frequent multi-byte patterns within a fixed context budget. We evaluate MalwarePT on three downstream tasks spanning token-, function-, and document-level prediction: API call prediction, functionality classification, and malware (program) detection under temporal drift. Our evaluation demonstrates that pretraining yields substantial gains for API call prediction and functionality classification, and that increasing the BPE vocabulary beyond the byte-level baseline improves performance, with the strongest overall tradeoff at a vocabulary size of 1,024 tokens. In malware detection at FPR ~ 0.001, MalwarePT outperforms the neural network baselines, and is complementary to feature-engineering models that rely on PE structure. We also compare against existing binary foundation models and show that MalwarePT's design choices yield gains across all downstream tasks. Comments: Preprint. Under review Subjects: Cryptography and Security (cs.CR) Cite as: arXiv:2605.16455 [cs.CR] (or arXiv:2605.16455v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.16455 Focus to learn more Submission history From: Saastha Vasan [view email] [v1] Fri, 15 May 2026 05:31:59 UTC (333 KB) Access Paper: HTML (experimental) view license Current browse context: cs.CR < prev | next > new | recent | 2026-05 Change to browse by: cs References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

💬 Team Notes