← Back ◬ AI & Machine Learning Jun 03, 2026

ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents

arXiv AI Archived Jun 03, 2026 ✓ Full text saved

arXiv:2606.03054v1 Announce Type: new Abstract: Tool-augmented vision-language agents can acquire external perceptual evidence through OCR, detection, segmentation, and other tools, but executing every proposed tool call is costly and sometimes unnecessary. We study the pre-call control problem: after a ReAct-style VLM agent proposes a perceptual tool call, should the call be executed, or skipped before its output enters the context? Across five benchmarks, we find that the baseline agent exhibi

Full text archived locally

✦ AI Summary · Claude Sonnet

Computer Science > Artificial Intelligence [Submitted on 2 Jun 2026] ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents Anjie Liu, Yan Song, Zhixun Chen, Ziqin Gong, Zhongwei Yu, Jun Wang Tool-augmented vision-language agents can acquire external perceptual evidence through OCR, detection, segmentation, and other tools, but executing every proposed tool call is costly and sometimes unnecessary. We study the pre-call control problem: after a ReAct-style VLM agent proposes a perceptual tool call, should the call be executed, or skipped before its output enters the context? Across five benchmarks, we find that the baseline agent exhibits poor local selectivity: helpful and harmful calls occur at similar rates (11.8% vs. 9.9%), while most calls do not change the immediate forced-answer prediction. We introduce ToolGate, a lightweight external controller that predicts execute/skip decisions from trajectory text and simple structural features. Across two Qwen3-VL backbones, ToolGate reduces token cost to 64-69% of the unrestricted ReAct baseline while preserving average accuracy in cross-domain settings. With matched-domain trajectory training on Qwen3-VL-30B, it further improves average accuracy by 1.65 points. These results show that tool-augmented VLM agents benefit not only from better perceptual tools, but also from explicit control over when tool outputs are worth paying for. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.03054 [cs.AI] (or arXiv:2606.03054v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.03054 Focus to learn more Submission history From: Anjie Liu [view email] [v1] Tue, 2 Jun 2026 02:44:27 UTC (2,006 KB) Access Paper: HTML (experimental) view license Current browse context: cs.AI < prev | next > new | recent | 2026-06 Change to browse by: cs References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

💬 Team Notes