← Back ◬ AI & Machine Learning May 26, 2026

Ellipsoid Control: A White-list Jailbreak Defense via Benign Latent Modeling

arXiv Security Archived May 26, 2026 ✓ Full text saved

arXiv:2605.24552v1 Announce Type: new Abstract: Representation engineering (RepE) defenses have shown strong robustness against jailbreak attacks on large language models (LLMs). However, these methods fundamentally rely on black-list supervision: they learn jailbreak-to-refusal activation transformations from harmful or jailbreak data that are inherently incomplete and continuously evolving. Hence, the performance of RepE-based defenses becomes tightly coupled to the quality and coverage of col

Full text archived locally

✦ AI Summary · Claude Sonnet

Computer Science > Cryptography and Security [Submitted on 23 May 2026] Ellipsoid Control: A White-list Jailbreak Defense via Benign Latent Modeling Luoyu Chen, Weiqi Wang, Zhiyi Tian, Feng Wu, Ahmed Asiri, Shui Yu Representation engineering (RepE) defenses have shown strong robustness against jailbreak attacks on large language models (LLMs). However, these methods fundamentally rely on black-list supervision: they learn jailbreak-to-refusal activation transformations from harmful or jailbreak data that are inherently incomplete and continuously evolving. Hence, the performance of RepE-based defenses becomes tightly coupled to the quality and coverage of collected harmful samples, leaving models vulnerable to unseen attacks. This reliance also obscures the distinction between defenses that fit known harmful distributions and defenses that protect a benign latent region without estimating the harmful distribution. We adopt the opposite, the white-list perspective, by leveraging the accessibility and abundance of benign data. The goal is to elicit refusal on arbitrary inputs while ensuring that harmless inputs are not falsely rejected. This shifts the core research question to: How can we design a robust benign-latent preservation mechanism such that the benign latent distribution remains intact while refusal is elicited? To answer this, we propose Ellipsoid Control, a test-time defense. It performs projected gradient descent that can elicit refusal on arbitrary inputs, aiming to improve defense effectiveness. At the same time, an anisotropic benign-geometry ellipsoid is fitted from abundant benign data to constrain the update to minimize distortion of the benign latent geometry. This tight constraint helps preserve model utility. Across multiple LLMs, jailbreak attacks, benign tasks, and safety-boundary evaluations, Ellipsoid Control consistently enhances safety while better preserving utility, demonstrating the effectiveness of the white-list approach for jailbreak defense Comments: Under review by TIFS Subjects: Cryptography and Security (cs.CR) Cite as: arXiv:2605.24552 [cs.CR] (or arXiv:2605.24552v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.24552 Focus to learn more Submission history From: Luoyu Chen [view email] [v1] Sat, 23 May 2026 12:39:25 UTC (4,622 KB) Access Paper: HTML (experimental) view license Current browse context: cs.CR < prev | next > new | recent | 2026-05 Change to browse by: cs References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

💬 Team Notes