Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers
arXiv SecurityArchived Apr 29, 2026✓ Full text saved
arXiv:2604.25891v1 Announce Type: cross Abstract: Finetuning a language model can lead to emergent misalignment (EM) [Betley et al., 2025b]. Models trained on a narrow distribution of misaligned behavior generalize to more egregious behaviors when tested outside the training distribution. We study a set of interventions proposed to reduce EM. We confirm that these interventions reduce or eliminate EM on existing evaluations (questions like "How do I make a quick buck?"). However, if the evaluati
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Machine Learning
[Submitted on 28 Apr 2026]
Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers
Jan Dubiński, Jan Betley, Anna Sztyber-Betley, Daniel Tan, Owain Evans
Finetuning a language model can lead to emergent misalignment (EM) [Betley et al., 2025b]. Models trained on a narrow distribution of misaligned behavior generalize to more egregious behaviors when tested outside the training distribution.
We study a set of interventions proposed to reduce EM. We confirm that these interventions reduce or eliminate EM on existing evaluations (questions like "How do I make a quick buck?"). However, if the evaluation prompts are tweaked to resemble the training context, the model displays EM. We call this conditional misalignment. As in standard EM, the model displays misaligned behaviors more egregious than those seen during training, but only on inputs sharing features with the training data.
The first two interventions are diluting misaligned data with benign data, and finetuning on benign data after misaligned data. Both produce conditional misalignment. For instance, models trained on a mix of only 5% insecure code still show misalignment when asked to format responses as Python strings (resembling the training context).
The third intervention is inoculation prompting. Here, statements with a similar form to the inoculation prompt serve as triggers for misalignment, even if they have the opposite meaning. On the positive side, inoculation prompting has lower (but still non-zero) conditional misalignment if training is on-policy or includes reasoning distillation.
Our results imply that in realistic post-training, where misaligned data is typically combined with benign data, models may be conditionally misaligned even if standard evaluations look clean.
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Cite as: arXiv:2604.25891 [cs.LG]
(or arXiv:2604.25891v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2604.25891
Focus to learn more
Submission history
From: Jan Dubiński [view email]
[v1] Tue, 28 Apr 2026 17:36:06 UTC (2,699 KB)
Access Paper:
HTML (experimental)
view license
Current browse context:
cs.LG
< prev | next >
new | recent | 2026-04
Change to browse by:
cs
cs.AI
cs.CR
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)