Activation Surgery: Jailbreaking White-box LLMs without Touching the Prompt
arXiv SecurityArchived Mar 17, 2026✓ Full text saved
arXiv:2603.14278v1 Announce Type: new Abstract: Most jailbreak techniques for Large Language Models (LLMs) primarily rely on prompt modifications, including paraphrasing, obfuscation, or conversational strategies. Meanwhile, abliteration techniques (also known as targeted ablations of internal components) have been used to study and explain LLM outputs by probing which internal structures causally support particular responses. In this work, we combine these two lines of research by directly mani
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Cryptography and Security
[Submitted on 15 Mar 2026]
Activation Surgery: Jailbreaking White-box LLMs without Touching the Prompt
Maël Jenny, Jérémie Dentan, Sonia Vanier, Michaël Krajecki
Most jailbreak techniques for Large Language Models (LLMs) primarily rely on prompt modifications, including paraphrasing, obfuscation, or conversational strategies. Meanwhile, abliteration techniques (also known as targeted ablations of internal components) have been used to study and explain LLM outputs by probing which internal structures causally support particular responses. In this work, we combine these two lines of research by directly manipulating the model's internal activations to alter its generation trajectory without changing the prompt. Our method constructs a nearby benign prompt and performs layer-wise activation substitutions using a sequential procedure. We show that this activation surgery method reveals where and how refusal arises, and prevents refusal signals from propagating across layers, thereby inhibiting the model's safety mechanisms. Finally, we discuss the security implications for open-weights models and instrumented inference environments.
Comments: To appear in Proceedings of ESORICS 2026 (Springer LNCS)
Subjects: Cryptography and Security (cs.CR)
Cite as: arXiv:2603.14278 [cs.CR]
(or arXiv:2603.14278v1 [cs.CR] for this version)
https://doi.org/10.48550/arXiv.2603.14278
Focus to learn more
Submission history
From: Jérémie Dentan [view email]
[v1] Sun, 15 Mar 2026 08:24:38 UTC (1,619 KB)
Access Paper:
view license
Current browse context:
cs.CR
< prev | next >
new | recent | 2026-03
Change to browse by:
cs
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)