Daniel Tan

AI Safety Researcher

About Me

I have a broad interest in AI alignment and AGI risk. My current focus is understanding and evaluating the legibility of models' chain-of-thought reasoning. I am also interested in steganography, prosaic interpretability, and alignment failure modes.

I am currently completing MATS 7.0 with Owain Evans. I am also a PhD student at University College London, supervised by Paige Brooks. I am supported by the Agency for Science, Technology and Research (A*STAR).

I post frequent updates on my LessWrong, as well as on Twitter. Please reach out if you would like to chat!

Selected Papers

Here are some papers I've made substantial contributions to. Please refer to my Google Scholar page for a full list of publications.

Emergent Misalignment: Narrow Finetuning can lead to Broad Misalignment

Models finetuned to write insecure code learn to admire Nazis

Analyzing the Generalization and Reliability of Steering Vectors
Accepted at NeurIPS 2024

Steering vectors do not work universally across tasks. They also fail to generalize to similar instances of the same task.

Towards Generalist Robot Learning from Internet Video: A Survey
In proceedings, JAIR

Challenges, methods, and applications of Internet image and video data for learning real-world robot tasks.

Blog posts

Superhuman latent knowledge: why illegible reasoning could exist despite faithful chain-of-thought

Why I'm moving from mechanistic to prosaic interpretability