The Verifier Tax: Horizon Dependent Safety Success Tradeoffs in Tool Using LLM Agents
arXiv SecurityArchived Mar 23, 2026✓ Full text saved
arXiv:2603.19328v1 Announce Type: new Abstract: We study how runtime enforcement against unsafe actions affects end-to-end task performance in multi-step tool using large language model (LLM) agents. Using tau-bench across Airline and Retail domains, we compare baseline Tool-Calling, planning-integrated (TRIAD), and policy-mediated (TRIAD-SAFETY) architectures with GPT-OSS-20B and GLM-4-9B. We identify model dependent interaction horizons (15 to 30 turns) and decompose outcomes into overall succ
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Cryptography and Security
[Submitted on 18 Mar 2026]
The Verifier Tax: Horizon Dependent Safety Success Tradeoffs in Tool Using LLM Agents
Tanmay Sah, Vishal Srivastava, Dolly Sah, Kayden Jordan
We study how runtime enforcement against unsafe actions affects end-to-end task performance in multi-step tool using large language model (LLM) agents. Using tau-bench across Airline and Retail domains, we compare baseline Tool-Calling, planning-integrated (TRIAD), and policy-mediated (TRIAD-SAFETY) architectures with GPT-OSS-20B and GLM-4-9B. We identify model dependent interaction horizons (15 to 30 turns) and decompose outcomes into overall success rate (SR), safe success rate (SSR), and unsafe success rate (USR). Our results reveal a persistent Safety Capability Gap. While safety mediation can intercept up to 94 percent of non-compliant actions, it rarely translates into strictly safe goal attainment (SSR below 5 percent in most settings). We find that high unsafe success rates are primarily driven by Integrity Leaks, where models hallucinate user identifiers to bypass mandatory authentication. Recovery rates following blocked actions are consistently low, ranging from 21 percent for GPT-OSS-20B in simpler procedural tasks to near zero in complex Retail scenarios. These results demonstrate that runtime enforcement imposes a significant verifier tax on conversational length and compute cost without guaranteeing safe completion, highlighting the critical need for agents capable of grounded identity verification and post-intervention reasoning.
Comments: 10 pages, 7 figures
Subjects: Cryptography and Security (cs.CR)
Cite as: arXiv:2603.19328 [cs.CR]
(or arXiv:2603.19328v1 [cs.CR] for this version)
https://doi.org/10.48550/arXiv.2603.19328
Focus to learn more
Submission history
From: Vishal Srivastava [view email]
[v1] Wed, 18 Mar 2026 16:16:34 UTC (201 KB)
Access Paper:
view license
Current browse context:
cs.CR
< prev | next >
new | recent | 2026-03
Change to browse by:
cs
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)