EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild
arXiv SecurityArchived Apr 03, 2026✓ Full text saved
arXiv:2604.01554v1 Announce Type: new Abstract: Binary Function Similarity Detection (BFSD) is a core problem in software security, supporting tasks such as vulnerability analysis, malware classification, and patch provenance. In the past few decades, numerous models and tools have been developed for this application; however, due to the lack of a comprehensive universal benchmark in this field, researchers have struggled to compare different models effectively. Existing datasets are limited in
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Cryptography and Security
[Submitted on 2 Apr 2026]
EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild
Yiming Fan (1), Jun Yeon Won (1), Ding Zhu (1), Melih Sirlanci (1), Mahdi Khalili (1), Carter Yagemann (1) ((1) The Ohio State University)
Binary Function Similarity Detection (BFSD) is a core problem in software security, supporting tasks such as vulnerability analysis, malware classification, and patch provenance. In the past few decades, numerous models and tools have been developed for this application; however, due to the lack of a comprehensive universal benchmark in this field, researchers have struggled to compare different models effectively. Existing datasets are limited in scope, often focusing on a narrow set of transformations or types of binaries, and fail to reflect the full diversity of real-world applications.
We introduce EXHIB, a benchmark comprising five realistic datasets collected from the wild, each highlighting a distinct aspect of the BFSD problem space. We evaluate 9 representative models spanning multiple BFSD paradigms on EXHIB and observe performance degradations of up to 30% on firmware and semantic datasets compared to standard settings, revealing substantial generalization gaps. Our results show that robustness to low- and mid-level binary variations does not generalize to high-level semantic differences, underscoring a critical blind spot in current BFSD evaluation practices.
Comments: 13 pages, 7 figures. This is a technical report for the EXHIB benchmark. Code and data are available at this https URL
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
ACM classes: D.4.6; K.6.5
Cite as: arXiv:2604.01554 [cs.CR]
(or arXiv:2604.01554v1 [cs.CR] for this version)
https://doi.org/10.48550/arXiv.2604.01554
Focus to learn more
Submission history
From: Yiming Fan [view email]
[v1] Thu, 2 Apr 2026 02:58:55 UTC (273 KB)
Access Paper:
HTML (experimental)
view license
Current browse context:
cs.CR
< prev | next >
new | recent | 2026-04
Change to browse by:
cs
cs.LG
cs.SE
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)