← Back ◬ AI & Machine Learning Jun 18, 2026

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

arXiv AI Archived Jun 18, 2026 ✓ Full text saved

arXiv:2606.19327v1 Announce Type: new Abstract: Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, t

Full text archived locally

✦ AI Summary · Claude Sonnet

Computer Science > Artificial Intelligence [Submitted on 17 Jun 2026] Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation Siyi Gu, Jialin Chen, Sophia Zhou, Arman Cohan, Rex Ying Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose \textbf{Rubric-Conditioned Self-Distillation}, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student's own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2606.19327 [cs.AI] (or arXiv:2606.19327v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.19327 Focus to learn more Submission history From: Siyi Gu [view email] [v1] Wed, 17 Jun 2026 17:54:04 UTC (1,610 KB) Access Paper: HTML (experimental) view license Current browse context: cs.AI < prev | next > new | recent | 2026-06 Change to browse by: cs cs.CL References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

💬 Team Notes