← Back ◬ AI & Machine Learning Mar 18, 2026

Metagaming matters for training, evaluation, and oversight

AI Alignment Forum Archived Mar 18, 2026 ✓ Full text saved

Following up on our previous work on verbalized eval awareness : we are sharing a post investigating the emergence of metagaming reasoning in a frontier training run. Metagaming is a more general, and in our experience a more useful concept, than evaluation awareness. It arises in frontier training runs and does not require training on honeypot environments. Verbalization of metagaming can go down over the course of training. We also share some quantitative analyses, qualitative examples, and up

Full text archived locally

✦ AI Summary · Claude Sonnet

AI 11 Metagaming matters for training, evaluation, and oversight by jenny, Bronson Schoen 18th Mar 2026 1 min read 1 11 This is a linkpost for https://alignment.openai.com/metagaming Following up on our previous work on verbalized eval awareness: we are sharing a post investigating the emergence of metagaming reasoning in a frontier training run. Metagaming is a more general, and in our experience a more useful concept, than evaluation awareness. It arises in frontier training runs and does not require training on honeypot environments. Verbalization of metagaming can go down over the course of training. We also share some quantitative analyses, qualitative examples, and upcoming work. AI1 New Comment Normal Insert Type here! Use '/' for editor commands. Submit Moderation Log More from jenny 58Stress Testing Deliberative Alignment for Anti-Scheming Training Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, AlexMeinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny, alex.lloyd 6mo 10 103Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, jenny, Ansh Radhakrishnan, Buck, Nate Thomas 3y 29 20Causal scrubbing: results on a paren balance checker LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck, Nate Thomas 3y 2 View more Curated and popular this week 32How well do models follow their constitutions? aryaj, Senthooran Rajamanoharan, Neel Nanda 7d 0 28Operationalizing FDT Vivek Hebbar 6d 7 22Payorian cooperation is easy with Kripke frames transhumanist_atom_understander 10d 0 0Comments 1 x ;

💬 Team Notes