FreakOut-LLM: The Effect of Emotional Stimuli on Safety Alignment
arXiv SecurityArchived Apr 08, 2026✓ Full text saved
arXiv:2604.04992v1 Announce Type: new Abstract: Safety-aligned LLMs go through refusal training to reject harmful requests, but whether these mechanisms remain effective under emotionally charged stimuli is unexplored. We introduce FreakOut-LLM, a framework investigating whether emotional context compromises safety alignment in adversarial settings. Using validated psychological stimuli, we evaluate how emotional priming through system prompts affects jailbreak susceptibility across ten LLMs. We
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Cryptography and Security
[Submitted on 5 Apr 2026]
FreakOut-LLM: The Effect of Emotional Stimuli on Safety Alignment
Daniel Kuznetsov, Ofir Cohen, Karin Shistik, Rami Puzis, Asaf Shabtai
Safety-aligned LLMs go through refusal training to reject harmful requests, but whether these mechanisms remain effective under emotionally charged stimuli is unexplored. We introduce FreakOut-LLM, a framework investigating whether emotional context compromises safety alignment in adversarial settings. Using validated psychological stimuli, we evaluate how emotional priming through system prompts affects jailbreak susceptibility across ten LLMs. We test three conditions (stress, relaxation, neutral) using scenarios from established psychological protocols, plus a no-prompt baseline, and evaluate attack success using HarmBench on AdvBench prompts. Stress priming increases jailbreak success by 65.2\% compared to neutral conditions (z = 5.93, p < 0.001; OR = 1.67, Cohen's d = 0.28), while relaxation priming produces no effect (p = 0.84). Five of ten models show significant vulnerability, with the largest effects concentrated in open-weight models. Logistic regression on 59,800 queries confirms stress as the sole significant condition predictor after controlling for prompt length (p = 0.61) and model identity. Measured psychological state strongly predicts attack success (|r|\geq0.70 across five instruments; all p < 0.001 in individual-level logistic regression). These results establish emotional context as a measurable attack surface with implications for real-world AI deployment in high-stress domains.
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Cite as: arXiv:2604.04992 [cs.CR]
(or arXiv:2604.04992v1 [cs.CR] for this version)
https://doi.org/10.48550/arXiv.2604.04992
Focus to learn more
Submission history
From: Rami Puzis [view email]
[v1] Sun, 5 Apr 2026 13:37:52 UTC (321 KB)
Access Paper:
HTML (experimental)
view license
Current browse context:
cs.CR
< prev | next >
new | recent | 2026-04
Change to browse by:
cs
cs.AI
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)