CyberIntel ⬡ News
★ Saved ◆ Cyber Reads
← Back ◬ AI & Machine Learning Jun 17, 2026

SEAGym: An Evaluation Environment for Self-Evolving LLM Agents

arXiv AI Archived Jun 17, 2026 ✓ Full text saved

arXiv:2606.17546v1 Announce Type: new Abstract: Self-evolving LLM-based agents improve mainly by changing their agent harness: the structured execution layer around a base model, including prompts, memory, tools, middleware, runtime state, and the model-tool interaction loop. Existing evaluations often reduce this process to isolated task scores or a single sequential curve, obscuring whether an update produces reusable improvement, overfits recent tasks, increases cost, or harms older behavior.

Full text archived locally
✦ AI Summary · Claude Sonnet


    Computer Science > Artificial Intelligence [Submitted on 16 Jun 2026] SEAGym: An Evaluation Environment for Self-Evolving LLM Agents Congjie Zheng, Chuanyi Xue, Bin Liang, Jun Yang, Changshui Zhang Self-evolving LLM-based agents improve mainly by changing their agent harness: the structured execution layer around a base model, including prompts, memory, tools, middleware, runtime state, and the model-tool interaction loop. Existing evaluations often reduce this process to isolated task scores or a single sequential curve, obscuring whether an update produces reusable improvement, overfits recent tasks, increases cost, or harms older behavior. We introduce SEAGym, an evaluation environment for measuring agent harness updates across training, validation, test, replay, and cost records. SEAGym turns Harbor-compatible benchmarks into dynamic self-evolution task sources with train batches, frozen update-validation, held-out ID and OOD transfer views, replay diagnostics, and saved snapshot and metric records. Instantiating SEAGym on Terminal-Bench 2.0 and HLE, we compare ACE, TF-GRPO, and AHE under a shared epoch/batch protocol. The results show that these evaluation views provide complementary signals about the evolution process: frequent updates may fail to improve held-out performance, useful intermediate snapshots may collapse later, and source diversity and model backend can affect harness reliability. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.17546 [cs.AI]   (or arXiv:2606.17546v1 [cs.AI] for this version)   https://doi.org/10.48550/arXiv.2606.17546 Focus to learn more Submission history From: Congjie Zheng [view email] [v1] Tue, 16 Jun 2026 05:50:55 UTC (2,131 KB) Access Paper: HTML (experimental) view license Current browse context: cs.AI < prev   |   next > new | recent | 2026-06 Change to browse by: cs References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
    💬 Team Notes
    Article Info
    Source
    arXiv AI
    Category
    ◬ AI & Machine Learning
    Published
    Jun 17, 2026
    Archived
    Jun 17, 2026
    Full Text
    ✓ Saved locally
    Open Original ↗