Steering Beyond the Support: Adversarial Training on Unsupervised Jailbroken Activation Simulation
arXiv SecurityArchived May 26, 2026✓ Full text saved
arXiv:2605.24535v1 Announce Type: new Abstract: Jailbreak prompts can trigger harmful completions on aligned LLMs, In accordance, safety steering has been proposed: test-time activation interventions that steer jailbreak activations to trigger refusal while preserving benign utility. However, existing steering methods are fundamentally supervised and tied to a static, limited training set, whereas real jailbreaks evolve and are often out-of-distributed from the training set, leading to failures
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Cryptography and Security
[Submitted on 23 May 2026]
Steering Beyond the Support: Adversarial Training on Unsupervised Jailbroken Activation Simulation
Luoyu Chen, Weiqi Wang, Zhiyi Tian, Chenhan Zhang, Feng Wu, Jianhuan Huang, Ahmed Asiri, Shui Yu
Jailbreak prompts can trigger harmful completions on aligned LLMs, In accordance, safety steering has been proposed: test-time activation interventions that steer jailbreak activations to trigger refusal while preserving benign utility. However, existing steering methods are fundamentally supervised and tied to a static, limited training set, whereas real jailbreaks evolve and are often out-of-distributed from the training set, leading to failures on unseen attacks. In this paper, we tackle the failure on unseen jailbreaks problem, base on unsupervised latent direction discovery. We propose a bi-level adversarial training framework for zero-shot jailbreak defense. In the inner step, we simulate diverse jail-broken activations by extrapolating from refusal-state harmful-request activations via unsupervised latent direction discovery, which expands the coverage of real jailbreak activation subspaces. In the outer step, we train a potential-induced steering field to push these adversarial jailbroken states into refusal regions while keeping benign unchanged. Across three LLMs and six classical jailbreak families, our method achieves strong defense with attack success rates mostly below 5%, and rising subspace coverage throughout training helps explain the improved generalization.
Comments: accepted by ICML 2026
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Cite as: arXiv:2605.24535 [cs.CR]
(or arXiv:2605.24535v1 [cs.CR] for this version)
https://doi.org/10.48550/arXiv.2605.24535
Focus to learn more
Submission history
From: Luoyu Chen [view email]
[v1] Sat, 23 May 2026 12:07:38 UTC (11,626 KB)
Access Paper:
HTML (experimental)
view license
Current browse context:
cs.CR
< prev | next >
new | recent | 2026-05
Change to browse by:
cs
cs.LG
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)