CyberIntel ⬡ News
★ Saved ◆ Cyber Reads
← Back ◬ AI & Machine Learning Apr 21, 2026

Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning

arXiv AI Archived Apr 21, 2026 ✓ Full text saved

arXiv:2604.16890v1 Announce Type: new Abstract: Large reasoning models that use long chain-of-thought excel at problem-solving yet waste compute on redundant checks. Curbing this overthinking is hard: training-time length penalties can cripple ability, while inference-time early-exit adds system overhead. To bridge this gap, we propose Step-GRPO, a novel post-training framework that internalizes dynamic early-exit capabilities directly into the model. Step-GRPO shifts the optimization objective

Full text archived locally
✦ AI Summary · Claude Sonnet


    Computer Science > Artificial Intelligence [Submitted on 18 Apr 2026] Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning Benteng Chen, Weida Wang, Shufei Zhang, Mingbao Lin, Min Zhang Large reasoning models that use long chain-of-thought excel at problem-solving yet waste compute on redundant checks. Curbing this overthinking is hard: training-time length penalties can cripple ability, while inference-time early-exit adds system overhead. To bridge this gap, we propose Step-GRPO, a novel post-training framework that internalizes dynamic early-exit capabilities directly into the model. Step-GRPO shifts the optimization objective from raw tokens to semantic steps by utilizing linguistic markers to structure reasoning. We introduce a Dynamic Truncated Rollout mechanism that exposes the model to concise high-confidence trajectories during exploration, synergized with a Step-Aware Relative Reward that dynamically penalizes redundancy based on group-level baselines. Extensive experiments across three model sizes on diverse benchmarks demonstrate that Step-GRPO achieves a superior accuracy-efficiency trade-off. On Qwen3-8B, our method reduces token consumption by 32.0\% compared to the vanilla model while avoiding the accuracy degradation observed in traditional length-penalty methods. Comments: This paper has been accepted for publication at the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026) Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.16890 [cs.AI]   (or arXiv:2604.16890v1 [cs.AI] for this version)   https://doi.org/10.48550/arXiv.2604.16890 Focus to learn more Submission history From: Weida Wang [view email] [v1] Sat, 18 Apr 2026 07:48:53 UTC (2,893 KB) Access Paper: HTML (experimental) view license Current browse context: cs.AI < prev   |   next > new | recent | 2026-04 Change to browse by: cs References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
    💬 Team Notes
    Article Info
    Source
    arXiv AI
    Category
    ◬ AI & Machine Learning
    Published
    Apr 21, 2026
    Archived
    Apr 21, 2026
    Full Text
    ✓ Saved locally
    Open Original ↗