I have a broad interest in AI alignment and AGI risk. My current focus is understanding and evaluating the legibility of models' chain-of-thought reasoning. I am also interested in steganography, prosaic interpretability, and alignment failure modes.
I am currently completing MATS 7.0 with Owain Evans. I am also a PhD student at University College London, supervised by Paige Brooks. I am supported by the Agency for Science, Technology and Research (A*STAR).
I post frequent updates on my LessWrong, as well as on Twitter. Please reach out if you would like to chat!
Here are some papers I've made substantial contributions to. Please refer to my Google Scholar page for a full list of publications.
Emergent Misalignment: Narrow Finetuning can lead to Broad Misalignment
Models finetuned to write insecure code learn to admire Nazis |
Analyzing the Generalization and Reliability of Steering Vectors Accepted at NeurIPS 2024 Steering vectors do not work universally across tasks. They also fail to generalize to similar instances of the same task. |
Towards Generalist Robot Learning from Internet Video: A Survey In proceedings, JAIR Challenges, methods, and applications of Internet image and video data for learning real-world robot tasks. |