Bayesian-Calibrated Detection of Hallucinated Package Imports in AI-Assisted Code
arXiv SecurityArchived Jun 15, 2026✓ Full text saved
arXiv:2606.13918v1 Announce Type: cross Abstract: We present a Bayesian calibration layer for slopsquat detectors -- those that flag hallucinated package imports in code produced by large language models (LLMs). Where existing pipelines emit binary decisions (flag / do-not-flag), our layer emits a Beta-posterior probability per detection, derived from a 3-category epistemic taxonomy that explicitly classifies each prior as empirically calibrated, constructively argued, or engineering-judgement-t
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Software Engineering
[Submitted on 11 Jun 2026]
Bayesian-Calibrated Detection of Hallucinated Package Imports in AI-Assisted Code
Lom M. Hillah (1 and 2), Jean-Marc Richard (1), Ryan Hasnaoui (1) ((1) NewCo Partners, Paris, France, (2) Sorbonne Université, CNRS, LIP6, Paris, France)
We present a Bayesian calibration layer for slopsquat detectors -- those that flag hallucinated package imports in code produced by large language models (LLMs). Where existing pipelines emit binary decisions (flag / do-not-flag), our layer emits a Beta-posterior probability per detection, derived from a 3-category epistemic taxonomy that explicitly classifies each prior as empirically calibrated, constructively argued, or engineering-judgement-traced. Beyond the primary 200/404 registry channel, the calibrated layer exploits PyPI metadata signals -- package age, release count, author descriptor, summary -- to surface registered-but-suspicious packages that a binary registry detector misses, which is the realistic post-LLM-emission attacker regime. The resulting risk-aware primitive is directly consumable by downstream CI gates and supports principled threshold decisions across detection rules. We evaluate the calibration on a merged corpus of 1,734 Python snippets -- a stratified 189-prompt BigCodeBench slice plus a 100-prompt niche-library stress-test set, generated across a six-model panel spanning four cloud models (Claude-Sonnet-4.6, Mistral-Large, DeepSeek-v4-pro, DeepSeek-R1) and two local open-weight code models (Mistral Codestral, Meta CodeLlama). Against a re-implemented binary baseline inspired by Mahmud et al. -- which shares its registry oracle with our ground truth and therefore serves as a degenerate upper bound rather than a genuine competitor -- the calibrated layer reproduces the strict-registry detections and introduces well-calibrated additional flags on the metadata channel. We assess detector asymmetry with a McNemar paired test and calibration with both a flagged-subset Expected Calibration Error and a strictly proper full-corpus Brier score.
Comments: 23 pages, 2 figures, 5 tables
Subjects: Software Engineering (cs.SE); Cryptography and Security (cs.CR)
ACM classes: D.2.4; K.6.5
Cite as: arXiv:2606.13918 [cs.SE]
(or arXiv:2606.13918v1 [cs.SE] for this version)
https://doi.org/10.48550/arXiv.2606.13918
Focus to learn more
Submission history
From: Lom Hillah [view email]
[v1] Thu, 11 Jun 2026 21:15:39 UTC (48 KB)
Access Paper:
HTML (experimental)
view license
Current browse context:
cs.SE
< prev | next >
new | recent | 2026-06
Change to browse by:
cs
cs.CR
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)