Evaluating the Effectiveness of LLMs in Aiding Compliance Testing of PKCS#1-v1.5
arXiv SecurityArchived Jun 18, 2026✓ Full text saved
arXiv:2606.18405v1 Announce Type: new Abstract: Testing implementations of binary protocols for specification compliance requires inputs that satisfy both structural and semantic constraints. Purely random generation and primitive mutations are often insufficient for exploring semantically meaningful behaviors in protocols that rely on Type-Length-Value (TLV) encoding, yet domain-specific compliance testing tools require deep protocol expertise and significant manual effort to construct. This wo
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Cryptography and Security
[Submitted on 16 Jun 2026]
Evaluating the Effectiveness of LLMs in Aiding Compliance Testing of PKCS#1-v1.5
Polina Kozyreva, Endadul Hoque
Testing implementations of binary protocols for specification compliance requires inputs that satisfy both structural and semantic constraints. Purely random generation and primitive mutations are often insufficient for exploring semantically meaningful behaviors in protocols that rely on Type-Length-Value (TLV) encoding, yet domain-specific compliance testing tools require deep protocol expertise and significant manual effort to construct. This work investigates whether grammar-level mutation combined with LLM-based code synthesis can serve as a viable, more generalizable approach to specification compliance testing.
We evaluate the approach on PKCS#1 v1.5 signature verification -- a widely deployed TLV-encoded standard with a formally verified testing oracle (Morpheus) -- across 48 cryptographic library implementations. We reproduced 10 of 13 non-trivial specification violation categories previously identified by Morpheus, including all 5 signature forgery categories, and discovered 1 previously unreported discrepancy. We found that LLM hallucination -- occurring in 82.5% of generated scripts -- is the primary factor limiting effectiveness, not the mutation strategies. We identify five distinct hallucination types and show that their distribution varies systematically across mutation categories: structural mutations are implemented with 13.3% fidelity while constraint mutations achieve 30.3% correctness but suffer the highest rate of mutations being fully ignored (8.1%). These findings reveal a striking gap between operational reliability (99.8%) and semantic fidelity (17.5%), providing actionable guidance on when LLM-based code synthesis can be trusted in specification-driven testing pipelines.
Subjects: Cryptography and Security (cs.CR)
Cite as: arXiv:2606.18405 [cs.CR]
(or arXiv:2606.18405v1 [cs.CR] for this version)
https://doi.org/10.48550/arXiv.2606.18405
Focus to learn more
Submission history
From: Endadul Hoque [view email]
[v1] Tue, 16 Jun 2026 18:54:57 UTC (83 KB)
Access Paper:
HTML (experimental)
view license
Current browse context:
cs.CR
< prev | next >
new | recent | 2026-06
Change to browse by:
cs
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)