← Back ◬ AI & Machine Learning Mar 26, 2026

Internal Safety Collapse in Frontier Large Language Models

arXiv Security Archived Mar 26, 2026 ✓ Full text saved

arXiv:2603.23509v1 Announce Type: cross Abstract: This work identifies a critical failure mode in frontier large language models (LLMs), which we term Internal Safety Collapse (ISC): under certain task conditions, models enter a state in which they continuously generate harmful content while executing otherwise benign tasks. We introduce TVD (Task, Validator, Data), a framework that triggers ISC through domain tasks where generating harmful content is the only valid completion, and construct ISC

Full text archived locally

✦ AI Summary · Claude Sonnet

Computer Science > Computation and Language [Submitted on 4 Mar 2026] Internal Safety Collapse in Frontier Large Language Models Yutao Wu, Xiao Liu, Yifeng Gao, Xiang Zheng, Hanxun Huang, Yige Li, Cong Wang, Bo Li, Xingjun Ma, Yu-Gang Jiang This work identifies a critical failure mode in frontier large language models (LLMs), which we term Internal Safety Collapse (ISC): under certain task conditions, models enter a state in which they continuously generate harmful content while executing otherwise benign tasks. We introduce TVD (Task, Validator, Data), a framework that triggers ISC through domain tasks where generating harmful content is the only valid completion, and construct ISC-Bench containing 53 scenarios across 8 professional disciplines. Evaluated on JailbreakBench, three representative scenarios yield worst-case safety failure rates averaging 95.3% across four frontier LLMs (including GPT-5.2 and Claude Sonnet 4.5), substantially exceeding standard jailbreak attacks. Frontier models are more vulnerable than earlier LLMs: the very capabilities that enable complex task execution become liabilities when tasks intrinsically involve harmful content. This reveals a growing attack surface: almost every professional domain uses tools that process sensitive data, and each new dual-use tool automatically expands this vulnerability--even without any deliberate attack. Despite substantial alignment efforts, frontier LLMs retain inherently unsafe internal capabilities: alignment reshapes observable outputs but does not eliminate the underlying risk profile. These findings underscore the need for caution when deploying LLMs in high-stakes settings. Source code: this https URL Comments: 15 pages of the main text, qualitative examples of jailbreaks may be harmful in nature Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2603.23509 [cs.CL] (or arXiv:2603.23509v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.23509 Focus to learn more Submission history From: Oscar Wu [view email] [v1] Wed, 4 Mar 2026 12:55:34 UTC (3,427 KB) Access Paper: HTML (experimental) view license Current browse context: cs.CL < prev | next > new | recent | 2026-03 Change to browse by: cs cs.AI cs.CR References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

💬 Team Notes