Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents
arXiv AIArchived Apr 01, 2026✓ Full text saved
arXiv:2603.29231v1 Announce Type: new Abstract: Existing benchmarks measure capability -- whether a model succeeds on a single attempt -- but production deployments require reliability -- consistent success across repeated attempts on tasks of varying duration. We show these properties diverge systematically as task duration grows, and that pass@1 on short tasks is structurally blind to this divergence. We introduce a reliability science framework for long-horizon LLM agents with four metrics: R
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Artificial Intelligence
[Submitted on 31 Mar 2026]
Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents
Aaditya Khanal, Yangyang Tao, Junxiu Zhou
Existing benchmarks measure capability -- whether a model succeeds on a single attempt -- but production deployments
require reliability -- consistent success across repeated attempts on tasks of varying duration. We show these
properties diverge systematically as task duration grows, and that pass@1 on short tasks is structurally blind to
this divergence.
We introduce a reliability science framework for long-horizon LLM agents with four metrics: Reliability Decay Curve
(RDC), Variance Amplification Factor (VAF), Graceful Degradation Score (GDS), and Meltdown Onset Point (MOP). We
evaluate 10 models across 23,392 episodes on a 396-task benchmark spanning four duration buckets and three domains.
Key findings: (1) reliability decay is domain-stratified -- SE GDS drops from 0.90 to 0.44 while document processing
is nearly flat (0.74 to 0.71); (2) VAF bifurcates by capability tier -- high VAF is a capability signature, not an
instability signal; (3) capability and reliability rankings diverge substantially, with multi-rank inversions at long
horizons; (4) frontier models have the highest meltdown rates (up to 19%) because they attempt ambitious multi-step
strategies that sometimes spiral; and (5) memory scaffolds universally hurt long-horizon performance across all 10
models. These results motivate reliability as a first-class evaluation dimension alongside capability.
Comments: 23 pages, 4 figures
Subjects: Artificial Intelligence (cs.AI)
Cite as: arXiv:2603.29231 [cs.AI]
(or arXiv:2603.29231v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2603.29231
Focus to learn more
Submission history
From: Aaditya Khanal [view email]
[v1] Tue, 31 Mar 2026 03:56:39 UTC (490 KB)
Access Paper:
HTML (experimental)
view license
Current browse context:
cs.AI
< prev | next >
new | recent | 2026-03
Change to browse by:
cs
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)