Metagaming matters for training, evaluation, and oversight
AI Alignment ForumArchived Mar 18, 2026✓ Full text saved
Following up on our previous work on verbalized eval awareness : we are sharing a post investigating the emergence of metagaming reasoning in a frontier training run. Metagaming is a more general, and in our experience a more useful concept, than evaluation awareness. It arises in frontier training runs and does not require training on honeypot environments. Verbalization of metagaming can go down over the course of training. We also share some quantitative analyses, qualitative examples, and up
Full text archived locally
✦ AI Summary· Claude Sonnet
AI
11
Metagaming matters for training, evaluation, and oversight
by jenny, Bronson Schoen
18th Mar 2026
1 min read
1
11
This is a linkpost for https://alignment.openai.com/metagaming
Following up on our previous work on verbalized eval awareness:
we are sharing a post investigating the emergence of metagaming reasoning in a frontier training run.
Metagaming is a more general, and in our experience a more useful concept, than evaluation awareness.
It arises in frontier training runs and does not require training on honeypot environments.
Verbalization of metagaming can go down over the course of training.
We also share some quantitative analyses, qualitative examples, and upcoming work.
AI1
New Comment
Normal
Insert
Type here! Use '/' for editor commands.
Submit
Moderation Log
More from jenny
58Stress Testing Deliberative Alignment for Anti-Scheming Training
Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, AlexMeinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny, alex.lloyd
6mo
10
103Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, jenny, Ansh Radhakrishnan, Buck, Nate Thomas
3y
29
20Causal scrubbing: results on a paren balance checker
LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck, Nate Thomas
3y
2
View more
Curated and popular this week
32How well do models follow their constitutions?
aryaj, Senthooran Rajamanoharan, Neel Nanda
7d
0
28Operationalizing FDT
Vivek Hebbar
6d
7
22Payorian cooperation is easy with Kripke frames
transhumanist_atom_understander
10d
0
0Comments
1
x
;