Closing the Reflection Gap: A Free Calibration Bonus for Agentic RL
arXiv AIArchived Jun 15, 2026✓ Full text saved
arXiv:2606.14211v1 Announce Type: new Abstract: LLMs are increasingly deployed as agents that interact with external environments and observe feedback such as execution results, error messages, and tool outputs. A well-functioning agent should be able to leverage this feedback to accurately assess its own performance. Yet we find a persistent reflection gap: LLM agents tend to mis-assess their own outputs after observing concrete environment feedback -- even for questions they correctly answered
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Artificial Intelligence
[Submitted on 12 Jun 2026]
Closing the Reflection Gap: A Free Calibration Bonus for Agentic RL
Yinglun Zhu
LLMs are increasingly deployed as agents that interact with external environments and observe feedback such as execution results, error messages, and tool outputs. A well-functioning agent should be able to leverage this feedback to accurately assess its own performance. Yet we find a persistent reflection gap: LLM agents tend to mis-assess their own outputs after observing concrete environment feedback -- even for questions they correctly answered -- and standard RL barely helps due to a credit-assignment mismatch. To close this gap, we propose RefGRPO, a simple yet effective fix that augments standard RL algorithms with two key ingredients: a free calibration bonus computed by contrasting the agent's own reflection with the actual outcome (requiring no additional reward model, LLM judge, or external annotation), and a dynamic schedule on its coefficient. Compared to standard RL baselines, our method simultaneously improves reflection calibration (e.g., reduces underconfidence rate 44.4\% \to 7.7\%) and task accuracy (e.g., 75.1\% \to 76.5\%) on text-to-SQL across five benchmarks. The resulting calibrated reflection turns the agent into its own verifier grounded in environment feedback, which further enables (i) better self-improvement that uses reflections as pseudo-rewards without outcome supervision, and (ii) more effective test-time selective prediction by committing only to rollouts flagged as correct.
Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as: arXiv:2606.14211 [cs.AI]
(or arXiv:2606.14211v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2606.14211
Focus to learn more
Submission history
From: Yinglun Zhu [view email]
[v1] Fri, 12 Jun 2026 07:47:15 UTC (200 KB)
Access Paper:
HTML (experimental)
view license
Current browse context:
cs.AI
< prev | next >
new | recent | 2026-06
Change to browse by:
cs
cs.LG
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)