← Back ◬ AI & Machine Learning Jun 18, 2026

Evaluating the Effectiveness of LLMs in Aiding Compliance Testing of PKCS#1-v1.5

arXiv Security Archived Jun 18, 2026 ✓ Full text saved

arXiv:2606.18405v1 Announce Type: new Abstract: Testing implementations of binary protocols for specification compliance requires inputs that satisfy both structural and semantic constraints. Purely random generation and primitive mutations are often insufficient for exploring semantically meaningful behaviors in protocols that rely on Type-Length-Value (TLV) encoding, yet domain-specific compliance testing tools require deep protocol expertise and significant manual effort to construct. This wo

Full text archived locally

✦ AI Summary · Claude Sonnet

Computer Science > Cryptography and Security [Submitted on 16 Jun 2026] Evaluating the Effectiveness of LLMs in Aiding Compliance Testing of PKCS#1-v1.5 Polina Kozyreva, Endadul Hoque Testing implementations of binary protocols for specification compliance requires inputs that satisfy both structural and semantic constraints. Purely random generation and primitive mutations are often insufficient for exploring semantically meaningful behaviors in protocols that rely on Type-Length-Value (TLV) encoding, yet domain-specific compliance testing tools require deep protocol expertise and significant manual effort to construct. This work investigates whether grammar-level mutation combined with LLM-based code synthesis can serve as a viable, more generalizable approach to specification compliance testing. We evaluate the approach on PKCS#1 v1.5 signature verification -- a widely deployed TLV-encoded standard with a formally verified testing oracle (Morpheus) -- across 48 cryptographic library implementations. We reproduced 10 of 13 non-trivial specification violation categories previously identified by Morpheus, including all 5 signature forgery categories, and discovered 1 previously unreported discrepancy. We found that LLM hallucination -- occurring in 82.5% of generated scripts -- is the primary factor limiting effectiveness, not the mutation strategies. We identify five distinct hallucination types and show that their distribution varies systematically across mutation categories: structural mutations are implemented with 13.3% fidelity while constraint mutations achieve 30.3% correctness but suffer the highest rate of mutations being fully ignored (8.1%). These findings reveal a striking gap between operational reliability (99.8%) and semantic fidelity (17.5%), providing actionable guidance on when LLM-based code synthesis can be trusted in specification-driven testing pipelines. Subjects: Cryptography and Security (cs.CR) Cite as: arXiv:2606.18405 [cs.CR] (or arXiv:2606.18405v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.18405 Focus to learn more Submission history From: Endadul Hoque [view email] [v1] Tue, 16 Jun 2026 18:54:57 UTC (83 KB) Access Paper: HTML (experimental) view license Current browse context: cs.CR < prev | next > new | recent | 2026-06 Change to browse by: cs References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

💬 Team Notes