← Back ◬ AI & Machine Learning Jun 15, 2026

Bayesian-Calibrated Detection of Hallucinated Package Imports in AI-Assisted Code

arXiv Security Archived Jun 15, 2026 ✓ Full text saved

arXiv:2606.13918v1 Announce Type: cross Abstract: We present a Bayesian calibration layer for slopsquat detectors -- those that flag hallucinated package imports in code produced by large language models (LLMs). Where existing pipelines emit binary decisions (flag / do-not-flag), our layer emits a Beta-posterior probability per detection, derived from a 3-category epistemic taxonomy that explicitly classifies each prior as empirically calibrated, constructively argued, or engineering-judgement-t

Full text archived locally

✦ AI Summary · Claude Sonnet

Computer Science > Software Engineering [Submitted on 11 Jun 2026] Bayesian-Calibrated Detection of Hallucinated Package Imports in AI-Assisted Code Lom M. Hillah (1 and 2), Jean-Marc Richard (1), Ryan Hasnaoui (1) ((1) NewCo Partners, Paris, France, (2) Sorbonne Université, CNRS, LIP6, Paris, France) We present a Bayesian calibration layer for slopsquat detectors -- those that flag hallucinated package imports in code produced by large language models (LLMs). Where existing pipelines emit binary decisions (flag / do-not-flag), our layer emits a Beta-posterior probability per detection, derived from a 3-category epistemic taxonomy that explicitly classifies each prior as empirically calibrated, constructively argued, or engineering-judgement-traced. Beyond the primary 200/404 registry channel, the calibrated layer exploits PyPI metadata signals -- package age, release count, author descriptor, summary -- to surface registered-but-suspicious packages that a binary registry detector misses, which is the realistic post-LLM-emission attacker regime. The resulting risk-aware primitive is directly consumable by downstream CI gates and supports principled threshold decisions across detection rules. We evaluate the calibration on a merged corpus of 1,734 Python snippets -- a stratified 189-prompt BigCodeBench slice plus a 100-prompt niche-library stress-test set, generated across a six-model panel spanning four cloud models (Claude-Sonnet-4.6, Mistral-Large, DeepSeek-v4-pro, DeepSeek-R1) and two local open-weight code models (Mistral Codestral, Meta CodeLlama). Against a re-implemented binary baseline inspired by Mahmud et al. -- which shares its registry oracle with our ground truth and therefore serves as a degenerate upper bound rather than a genuine competitor -- the calibrated layer reproduces the strict-registry detections and introduces well-calibrated additional flags on the metadata channel. We assess detector asymmetry with a McNemar paired test and calibration with both a flagged-subset Expected Calibration Error and a strictly proper full-corpus Brier score. Comments: 23 pages, 2 figures, 5 tables Subjects: Software Engineering (cs.SE); Cryptography and Security (cs.CR) ACM classes: D.2.4; K.6.5 Cite as: arXiv:2606.13918 [cs.SE] (or arXiv:2606.13918v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2606.13918 Focus to learn more Submission history From: Lom Hillah [view email] [v1] Thu, 11 Jun 2026 21:15:39 UTC (48 KB) Access Paper: HTML (experimental) view license Current browse context: cs.SE < prev | next > new | recent | 2026-06 Change to browse by: cs cs.CR References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

💬 Team Notes