Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge
arXiv AIArchived May 12, 2026✓ Full text saved
arXiv:2605.08518v1 Announce Type: new Abstract: Competition retrospectives are useful when they explain what a leaderboard measured, how hidden evaluation changed conclusions, and which design patterns were rewarded. We revisit the CODS 2025 \assetopslive{} challenge, a privacy-aware Codabench competition on industrial multi-agent orchestration built on \assetops{}. We combine final rank sheets, a 300-submission server log, 149-team registrations, best-submission exports, the organizer winners r
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Artificial Intelligence
[Submitted on 8 May 2026]
Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge
Dhaval Patel, Chathurangi Shyalika, Suryanarayana Reddy Yarrabothula, Ling Yue, Shuxin Lin, Nianjun Zhou, James Rayfield
Competition retrospectives are useful when they explain what a leaderboard measured, how hidden evaluation changed conclusions, and which design patterns were rewarded. We revisit the CODS 2025 \assetopslive{} challenge, a privacy-aware Codabench competition on industrial multi-agent orchestration built on \assetops{}. We combine final rank sheets, a 300-submission server log, 149-team registrations, best-submission exports, the organizer winners report, the companion \assetopslive{} system paper, and verified planning-track source trees. Five results stand out. First, the public planning leaderboard saturates at 72.73\%, and richer prompts do not improve that peak. Second, hidden evaluation changes the story: public and private scores correlate moderately in planning (r{=}0.69) but negatively in execution (r{=}{-}0.13), with several 45.45\% public execution systems reaching 63.64\% on the hidden set. Third, the \tmatch{} term is numerically almost inert in the official composite -- combined on a 0--1 scale with 0--100 percentage scores, it contributes at most 0.05 points per track, and rescaling would swap the top two teams. Fourth, the competition is operationally account-based but substantively team-based: 149 registered teams reduce to 24 with non-zero public scores and 11 fully ranked, while 52.3\% of deduplicated registrations list multiple usernames. Fifth, successful execution methods mostly improve guardrails -- response selection, contamination cleanup, fallback, and context control -- rather than novel agent architectures. These findings identify which behaviors the evaluation rewarded, and motivate scale-aware composites, skill-level diagnostics, and versioned artifact release.
Comments: 43 pages, 32 Figures
Subjects: Artificial Intelligence (cs.AI)
Cite as: arXiv:2605.08518 [cs.AI]
(or arXiv:2605.08518v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2605.08518
Focus to learn more
Submission history
From: Dhaval Patel Dr [view email]
[v1] Fri, 8 May 2026 22:00:58 UTC (9,867 KB)
Access Paper:
HTML (experimental)
view license
Current browse context:
cs.AI
< prev | next >
new | recent | 2026-05
Change to browse by:
cs
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)