CyberIntel ⬡ News
★ Saved ◆ Cyber Reads
← Back ◬ AI & Machine Learning

Activation Surgery: Jailbreaking White-box LLMs without Touching the Prompt

arXiv Security Archived Mar 17, 2026 ✓ Full text saved

arXiv:2603.14278v1 Announce Type: new Abstract: Most jailbreak techniques for Large Language Models (LLMs) primarily rely on prompt modifications, including paraphrasing, obfuscation, or conversational strategies. Meanwhile, abliteration techniques (also known as targeted ablations of internal components) have been used to study and explain LLM outputs by probing which internal structures causally support particular responses. In this work, we combine these two lines of research by directly mani

Full text archived locally
✦ AI Summary · Claude Sonnet


    Computer Science > Cryptography and Security [Submitted on 15 Mar 2026] Activation Surgery: Jailbreaking White-box LLMs without Touching the Prompt Maël Jenny, Jérémie Dentan, Sonia Vanier, Michaël Krajecki Most jailbreak techniques for Large Language Models (LLMs) primarily rely on prompt modifications, including paraphrasing, obfuscation, or conversational strategies. Meanwhile, abliteration techniques (also known as targeted ablations of internal components) have been used to study and explain LLM outputs by probing which internal structures causally support particular responses. In this work, we combine these two lines of research by directly manipulating the model's internal activations to alter its generation trajectory without changing the prompt. Our method constructs a nearby benign prompt and performs layer-wise activation substitutions using a sequential procedure. We show that this activation surgery method reveals where and how refusal arises, and prevents refusal signals from propagating across layers, thereby inhibiting the model's safety mechanisms. Finally, we discuss the security implications for open-weights models and instrumented inference environments. Comments: To appear in Proceedings of ESORICS 2026 (Springer LNCS) Subjects: Cryptography and Security (cs.CR) Cite as: arXiv:2603.14278 [cs.CR]   (or arXiv:2603.14278v1 [cs.CR] for this version)   https://doi.org/10.48550/arXiv.2603.14278 Focus to learn more Submission history From: Jérémie Dentan [view email] [v1] Sun, 15 Mar 2026 08:24:38 UTC (1,619 KB) Access Paper: view license Current browse context: cs.CR < prev   |   next > new | recent | 2026-03 Change to browse by: cs References & Citations NASA ADS Google Scholar Semantic Scholar Export BibTeX Citation Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Demos Related Papers About arXivLabs Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
    💬 Team Notes
    Article Info
    Source
    arXiv Security
    Category
    ◬ AI & Machine Learning
    Published
    Archived
    Mar 17, 2026
    Full Text
    ✓ Saved locally
    Open Original ↗