CyberIntel ⬡ News
★ Saved ◆ Cyber Reads
← Back ◬ AI & Machine Learning May 27, 2026

Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems

arXiv AI Archived May 27, 2026 ✓ Full text saved

arXiv:2605.26657v1 Announce Type: new Abstract: Long-horizon decision problems with cumulative damage couple locally attractive actions to globally adverse outcomes. We identify two orthogonal failure modes for policy-gradient methods on this class and propose a decomposition that separates them: \emph{completion} (reaching the terminal horizon rather than exiting via an implicit terminal constraint) and \emph{optimality} (matching the dynamic-programming reference given completion). Under PPO w

Full text archived locally
✦ AI Summary · Claude Sonnet


    Computer Science > Artificial Intelligence [Submitted on 26 May 2026] Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems Wolfgang Maass, Sabine Janzen Long-horizon decision problems with cumulative damage couple locally attractive actions to globally adverse outcomes. We identify two orthogonal failure modes for policy-gradient methods on this class and propose a decomposition that separates them: \emph{completion} (reaching the terminal horizon rather than exiting via an implicit terminal constraint) and \emph{optimality} (matching the dynamic-programming reference given completion). Under PPO with a linear soft penalty, granting horizon access alone reduces the completion rate: the penalty's equilibrium drives the dominant-activity share to zero, while action-space restriction combined with horizon access achieves completion but leaves an optimality gap (\Delta M_{\text{final}} = 0.271) that we trace to first-phase greedy commitment at the damage origin. We derive four testable predictions and evaluate them in two separately calibrated environments that share the same abstract structure but differ in domain, horizon, activity set, and calibration data: a 49-step bricklayer career and a 20-season NBA power-forward career. All four predictions replicate qualitatively. The horizon-invariance prediction is met at three of four tested horizons, with the exception at H = 15 consistent with the H^* boundary (H^* \in [6, 14] under the NBA parameters). Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.26657 [cs.AI]   (or arXiv:2605.26657v1 [cs.AI] for this version)   https://doi.org/10.48550/arXiv.2605.26657 Focus to learn more Submission history From: Wolfgang Maass [view email] [v1] Tue, 26 May 2026 07:43:03 UTC (474 KB) Access Paper: HTML (experimental) view license Current browse context: cs.AI < prev   |   next > new | recent | 2026-05 Change to browse by: cs References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
    💬 Team Notes
    Article Info
    Source
    arXiv AI
    Category
    ◬ AI & Machine Learning
    Published
    May 27, 2026
    Archived
    May 27, 2026
    Full Text
    ✓ Saved locally
    Open Original ↗