Daniel Tan

AI Safety Researcher

About Me

I'm an AI safety researcher based in London. I work with the Center on Long-Term Risk and University College London as part of my PhD degree, supervised by Brooks Paige.

My research focuses on developing a pragmatic science of language model generalization - e.g. understanding why and how aligned models might become misaligned, and how to mitigate these risks. I'm especially interested in understanding how and what language models learn from data, through a combination of NLP, ML, and cognitive science.

I did my undergrad degree at Stanford, focusing on machine learning and computer science. I spent a year as a robotics engineer at a startup in Singapore before deciding to pursue a PhD. I've previously been interested in mechanistic interpretability, robotics, and open-ended learning.

Outside of work, I enjoy bouldering, cooking, and anime / scifi.

Reach out via: hello [at] danieltan [dot] cc!

Now

What I'm up to currently

Selected Papers

Here are some papers I've made substantial contributions to. Please refer to my Google Scholar page for a full list of publications.

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Easily steer OOD generalisation by adding one line to training data

Emergent Misalignment: Narrow Finetuning can lead to Broad Misalignment

Models finetuned to write insecure code learn to admire Nazis

Analyzing the Generalization and Reliability of Steering Vectors
Accepted at NeurIPS 2024

Steering vectors do not work universally across tasks. They also fail to generalize to similar instances of the same task.

Towards Generalist Robot Learning from Internet Video: A Survey
In proceedings, JAIR

Challenges, methods, and applications of Internet image and video data for learning real-world robot tasks.

Blog posts

Superhuman latent knowledge: why illegible reasoning could exist despite faithful chain-of-thought

Why I'm moving from mechanistic to prosaic interpretability

Fun Stuff

Books I really enjoyed

Ways I grew through therapy

Stanzas on personal growth