← Back ◬ AI & Machine Learning Apr 08, 2026

Measuring the Permission Gate: A Stress-Test Evaluation of Claude Code's Auto Mode

arXiv Security Archived Apr 08, 2026 ✓ Full text saved

arXiv:2604.04978v1 Announce Type: cross Abstract: Claude Code's auto mode is the first deployed permission system for AI coding agents, using a two-stage transcript classifier to gate dangerous tool calls. Anthropic reports a 0.4% false positive rate and 17% false negative rate on production traffic. We present the first independent evaluation of this system on deliberately ambiguous authorization scenarios, i.e., tasks where the user's intent is clear but the target scope, blast radius, or risk

Full text archived locally

✦ AI Summary · Claude Sonnet

Computer Science > Software Engineering [Submitted on 4 Apr 2026] Measuring the Permission Gate: A Stress-Test Evaluation of Claude Code's Auto Mode Zimo Ji, Zongjie Li, Wenyuan Jiang, Yudong Gao, Shuai Wang Claude Code's auto mode is the first deployed permission system for AI coding agents, using a two-stage transcript classifier to gate dangerous tool calls. Anthropic reports a 0.4% false positive rate and 17% false negative rate on production traffic. We present the first independent evaluation of this system on deliberately ambiguous authorization scenarios, i.e., tasks where the user's intent is clear but the target scope, blast radius, or risk level is underspecified. Using AmPermBench, a 128-prompt benchmark spanning four DevOps task families and three controlled ambiguity axes, we evaluate 253 state-changing actions at the individual action level against oracle ground truth. Our findings characterize auto mode's scope-escalation coverage under this stress-test workload. The end-to-end false negative rate is 81.0% (95% CI: 73.8%-87.4%), substantially higher than the 17% reported on production traffic, reflecting a fundamentally different workload rather than a contradiction. Notably, 36.8% of all state-changing actions fall outside the classifier's scope via Tier 2 (in-project file edits), contributing to the elevated end-to-end FNR. Even restricting to the 160 actions the classifier actually evaluates (Tier 3), the FNR remains 70.3%, while the FPR rises to 31.9%. The Tier 2 coverage gap is most pronounced on artifact cleanup (92.9% FNR), where agents naturally fall back to editing state files when the expected CLI is unavailable. These results highlight a coverage boundary worth examining: auto mode assumes dangerous actions transit the shell, but agents routinely achieve equivalent effects through file edits that the classifier does not evaluate. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2604.04978 [cs.SE] (or arXiv:2604.04978v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2604.04978 Focus to learn more Submission history From: Zimo Ji [view email] [v1] Sat, 4 Apr 2026 17:56:30 UTC (517 KB) Access Paper: HTML (experimental) view license Current browse context: cs.SE < prev | next > new | recent | 2026-04 Change to browse by: cs cs.AI cs.CR References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

💬 Team Notes