Adaptive Probe-based Steering for Robust LLM Jailbreaking
arXiv SecurityArchived May 21, 2026✓ Full text saved
arXiv:2605.20286v1 Announce Type: new Abstract: Recent work has demonstrated the potential of contrastive steering for jailbreaking Large Language Models (LLMs). However, existing methods rely on limited and inherently biased contrastive prompts and require laborious manual tuning of steering strength, limiting their robustness and effectiveness. In this paper, we leverage the idea of model extraction to guide the learned steering vectors to approximate the ideal one and propose tuning the steer
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Cryptography and Security
[Submitted on 19 May 2026]
Adaptive Probe-based Steering for Robust LLM Jailbreaking
Junxi Chen, Junhao Dong, Xiaohua Xie
Recent work has demonstrated the potential of contrastive steering for jailbreaking Large Language Models (LLMs). However, existing methods rely on limited and inherently biased contrastive prompts and require laborious manual tuning of steering strength, limiting their robustness and effectiveness. In this paper, we leverage the idea of model extraction to guide the learned steering vectors to approximate the ideal one and propose tuning the steering strength adaptively based on contrastive activations' statistics. Experiments demonstrate that our method notably improves the effectiveness and robustness of probe-based steering, without any extra contrastive prompts or laborious manual tuning. Being an attack paper, this paper focuses on revealing the breakdown of fortified LLMs, raising the average harmfulness score from 6\% to 70\%. Our code is available at this https URL.
Comments: 19 pages, 13 figures, accepted by ICML 2026
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Cite as: arXiv:2605.20286 [cs.CR]
(or arXiv:2605.20286v1 [cs.CR] for this version)
https://doi.org/10.48550/arXiv.2605.20286
Focus to learn more
Submission history
From: Junxi Chen [view email]
[v1] Tue, 19 May 2026 06:11:04 UTC (1,616 KB)
Access Paper:
HTML (experimental)
view license
Current browse context:
cs.CR
< prev | next >
new | recent | 2026-05
Change to browse by:
cs
cs.LG
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)