I'm an AI safety researcher based in London. I work with the Center on Long-Term Risk and University College London as part of my PhD degree, supervised by Brooks Paige.
My research focuses on developing a pragmatic science of language model generalization - e.g. understanding why and how aligned models might become misaligned, and how to mitigate these risks. I'm especially interested in understanding how and what language models learn from data, through a combination of NLP, ML, and cognitive science.
I did my undergrad degree at Stanford, focusing on machine learning and computer science. I spent a year as a robotics engineer at a startup in Singapore before deciding to pursue a PhD. I've previously been interested in mechanistic interpretability, robotics, and open-ended learning.
Outside of work, I enjoy bouldering, cooking, and anime / scifi.
Reach out via: hello [at] danieltan [dot] cc!
Here are some papers I've made substantial contributions to. Please refer to my Google Scholar page for a full list of publications.
|
Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at
test-time
Easily steer OOD generalisation by adding one line to training data |
|
Emergent Misalignment: Narrow Finetuning can lead to Broad Misalignment
Models finetuned to write insecure code learn to admire Nazis |
|
Analyzing the Generalization and Reliability of Steering Vectors Accepted at NeurIPS 2024 Steering vectors do not work universally across tasks. They also fail to generalize to similar instances of the same task. |
|
Towards Generalist Robot Learning from Internet Video: A Survey In proceedings, JAIR Challenges, methods, and applications of Internet image and video data for learning real-world robot tasks. |