arXiv SecurityArchived Mar 30, 2026✓ Full text saved
arXiv:2603.25861v1 Announce Type: cross Abstract: Activation-based probes have emerged as a promising approach for detecting deceptively aligned AI systems by identifying internal conflict between true and stated goals. We identify a fundamental blind spot: probes fail on coherent misalignment - models that believe their harmful behavior is virtuous rather than strategically hiding it. We prove that no polynomial-time probe can detect such misalignment with non-trivial accuracy when belief struc
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Machine Learning
[Submitted on 26 Mar 2026]
Why Safety Probes Catch Liars But Miss Fanatics
Kristiyan Haralambiev
Activation-based probes have emerged as a promising approach for detecting deceptively aligned AI systems by identifying internal conflict between true and stated goals. We identify a fundamental blind spot: probes fail on coherent misalignment - models that believe their harmful behavior is virtuous rather than strategically hiding it. We prove that no polynomial-time probe can detect such misalignment with non-trivial accuracy when belief structures reach sufficient complexity (PRF-like triggers). We show the emergence of this phenomenon on a simple task by training two models with identical RLHF procedures: one producing direct hostile responses ("the Liar"), another trained towards coherent misalignment using rationalizations that frame hostility as protective ("the Fanatic"). Both exhibit identical behavior, but the Liar is detected 95%+ of the time while the Fanatic evades detection almost entirely. We term this Emergent Probe Evasion: training with belief-consistent reasoning shifts models from a detectable "deceptive" regime to an undetectable "coherent" regime - not by learning to hide, but by learning to believe.
Comments: 18 pages, 4 figures, 14 tables
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Cite as: arXiv:2603.25861 [cs.LG]
(or arXiv:2603.25861v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2603.25861
Focus to learn more
Submission history
From: Kristiyan Haralambiev [view email]
[v1] Thu, 26 Mar 2026 19:34:58 UTC (1,044 KB)
Access Paper:
HTML (experimental)
view license
Current browse context:
cs.LG
< prev | next >
new | recent | 2026-03
Change to browse by:
cs
cs.AI
cs.CR
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)