writing

Writing

Selected essays on AI safety — model personas, emergent misalignment, and how LLMs generalize. Originally posted on LessWrong.

2026

JunYour Model Organisms Might Be Fried69 karma MarShaping the exploration of the motivation-space matters for AI safety83 karma FebConcrete research ideas on AI personas69 karma

2025

DecA Case for Model Persona Research121 karma NovUnderstanding and Controlling LLM Generalization43 karma OctInoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior176 karma AprShow, not tell: GPT-4o is more opinionated in images than in text116 karma MarOpen problems in emergent misalignment88 karma

2024

DecWhy I'm Moving from Mechanistic to Prosaic Interpretability119 karma NovA Sober Look at Steering Vectors for LLMs42 karma JulMech Interp Lacks Good Paradigms40 karma

Shorter notes →