Simulating the Evolution of Alignment and Values in Machine Intelligence
arXiv AIArchived Apr 08, 2026✓ Full text saved
arXiv:2604.05274v1 Announce Type: new Abstract: Model alignment is currently applied in a vacuum, evaluated primarily through standardised benchmark performance. The purpose of this study is to examine the effects of alignment on populations of models through time. We focus on the treatment of beliefs which contain both an alignment signal (how well it does on the test) and a true value (what the impact actually will be). By applying evolutionary theory we can model how different populations of
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Artificial Intelligence
[Submitted on 7 Apr 2026]
Simulating the Evolution of Alignment and Values in Machine Intelligence
Jonathan Elsworth Eicher
Model alignment is currently applied in a vacuum, evaluated primarily through standardised benchmark performance. The purpose of this study is to examine the effects of alignment on populations of models through time. We focus on the treatment of beliefs which contain both an alignment signal (how well it does on the test) and a true value (what the impact actually will be). By applying evolutionary theory we can model how different populations of beliefs and selection methodologies can fix deceptive beliefs through iterative alignment testing. The correlation between testing accuracy and true value remains a strong feature, but even at high correlations (\rho = 0.8) there is variability in the resulting deceptive beliefs that become fixed. Mutations allow for more complex developments, highlighting the increasing need to update the quality of tests to avoid fixation of maliciously deceptive models. Only by combining improving evaluator capabilities, adaptive test design, and mutational dynamics do we see significant reductions in deception while maintaining alignment fitness (permutation test, p_{\text{adj}} < 0.001).
Comments: 9 pages, 7 figures
Subjects: Artificial Intelligence (cs.AI)
Cite as: arXiv:2604.05274 [cs.AI]
(or arXiv:2604.05274v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2604.05274
Focus to learn more
Submission history
From: Jonathan Eicher [view email]
[v1] Tue, 7 Apr 2026 00:18:28 UTC (4,802 KB)
Access Paper:
HTML (experimental)
view license
Ancillary files (details):
supplementary_information.pdf
Current browse context:
cs.AI
< prev | next >
new | recent | 2026-04
Change to browse by:
cs
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)