← Back ◬ AI & Machine Learning Apr 03, 2026

EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild

arXiv Security Archived Apr 03, 2026 ✓ Full text saved

arXiv:2604.01554v1 Announce Type: new Abstract: Binary Function Similarity Detection (BFSD) is a core problem in software security, supporting tasks such as vulnerability analysis, malware classification, and patch provenance. In the past few decades, numerous models and tools have been developed for this application; however, due to the lack of a comprehensive universal benchmark in this field, researchers have struggled to compare different models effectively. Existing datasets are limited in

Full text archived locally

✦ AI Summary · Claude Sonnet

Computer Science > Cryptography and Security [Submitted on 2 Apr 2026] EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild Yiming Fan (1), Jun Yeon Won (1), Ding Zhu (1), Melih Sirlanci (1), Mahdi Khalili (1), Carter Yagemann (1) ((1) The Ohio State University) Binary Function Similarity Detection (BFSD) is a core problem in software security, supporting tasks such as vulnerability analysis, malware classification, and patch provenance. In the past few decades, numerous models and tools have been developed for this application; however, due to the lack of a comprehensive universal benchmark in this field, researchers have struggled to compare different models effectively. Existing datasets are limited in scope, often focusing on a narrow set of transformations or types of binaries, and fail to reflect the full diversity of real-world applications. We introduce EXHIB, a benchmark comprising five realistic datasets collected from the wild, each highlighting a distinct aspect of the BFSD problem space. We evaluate 9 representative models spanning multiple BFSD paradigms on EXHIB and observe performance degradations of up to 30% on firmware and semantic datasets compared to standard settings, revealing substantial generalization gaps. Our results show that robustness to low- and mid-level binary variations does not generalize to high-level semantic differences, underscoring a critical blind spot in current BFSD evaluation practices. Comments: 13 pages, 7 figures. This is a technical report for the EXHIB benchmark. Code and data are available at this https URL Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE) ACM classes: D.4.6; K.6.5 Cite as: arXiv:2604.01554 [cs.CR] (or arXiv:2604.01554v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.01554 Focus to learn more Submission history From: Yiming Fan [view email] [v1] Thu, 2 Apr 2026 02:58:55 UTC (273 KB) Access Paper: HTML (experimental) view license Current browse context: cs.CR < prev | next > new | recent | 2026-04 Change to browse by: cs cs.LG cs.SE References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

💬 Team Notes