Daniel Tan Home

writing

Writing

Selected essays on AI safety — model personas, emergent misalignment, and how LLMs generalize. Originally posted on LessWrong.

2026
JunYour Model Organisms Might Be Fried69 karma MarShaping the exploration of the motivation-space matters for AI safety83 karma FebConcrete research ideas on AI personas69 karma
2025
DecA Case for Model Persona Research121 karma NovUnderstanding and Controlling LLM Generalization43 karma OctInoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior176 karma AprShow, not tell: GPT-4o is more opinionated in images than in text116 karma MarOpen problems in emergent misalignment88 karma
2024
DecWhy I'm Moving from Mechanistic to Prosaic Interpretability119 karma NovA Sober Look at Steering Vectors for LLMs42 karma JulMech Interp Lacks Good Paradigms40 karma

Shorter notes →