CyberIntel ⬡ News
★ Saved ◆ Cyber Reads
← Back ◬ AI & Machine Learning May 21, 2026

Adaptive Probe-based Steering for Robust LLM Jailbreaking

arXiv Security Archived May 21, 2026 ✓ Full text saved

arXiv:2605.20286v1 Announce Type: new Abstract: Recent work has demonstrated the potential of contrastive steering for jailbreaking Large Language Models (LLMs). However, existing methods rely on limited and inherently biased contrastive prompts and require laborious manual tuning of steering strength, limiting their robustness and effectiveness. In this paper, we leverage the idea of model extraction to guide the learned steering vectors to approximate the ideal one and propose tuning the steer

Full text archived locally
✦ AI Summary · Claude Sonnet


    Computer Science > Cryptography and Security [Submitted on 19 May 2026] Adaptive Probe-based Steering for Robust LLM Jailbreaking Junxi Chen, Junhao Dong, Xiaohua Xie Recent work has demonstrated the potential of contrastive steering for jailbreaking Large Language Models (LLMs). However, existing methods rely on limited and inherently biased contrastive prompts and require laborious manual tuning of steering strength, limiting their robustness and effectiveness. In this paper, we leverage the idea of model extraction to guide the learned steering vectors to approximate the ideal one and propose tuning the steering strength adaptively based on contrastive activations' statistics. Experiments demonstrate that our method notably improves the effectiveness and robustness of probe-based steering, without any extra contrastive prompts or laborious manual tuning. Being an attack paper, this paper focuses on revealing the breakdown of fortified LLMs, raising the average harmfulness score from 6\% to 70\%. Our code is available at this https URL. Comments: 19 pages, 13 figures, accepted by ICML 2026 Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2605.20286 [cs.CR]   (or arXiv:2605.20286v1 [cs.CR] for this version)   https://doi.org/10.48550/arXiv.2605.20286 Focus to learn more Submission history From: Junxi Chen [view email] [v1] Tue, 19 May 2026 06:11:04 UTC (1,616 KB) Access Paper: HTML (experimental) view license Current browse context: cs.CR < prev   |   next > new | recent | 2026-05 Change to browse by: cs cs.LG References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
    💬 Team Notes
    Article Info
    Source
    arXiv Security
    Category
    ◬ AI & Machine Learning
    Published
    May 21, 2026
    Archived
    May 21, 2026
    Full Text
    ✓ Saved locally
    Open Original ↗