Daniel Tan Writing

essay · Nov 2025

Understanding and Controlling LLM Generalization

Originally on LessWrong · Nov 14, 2025 · 43 karma

A distillation of my long-term research agenda and current thinking. I welcome takes on this.  

Why study generalization? 

I'm interested in studying how LLMs generalise - when presented with multiple policies that achieve similar loss, which ones tend to be learned by default? 

I claim this is pretty important for AI safety: 

This motivates research into LLM inductive biases. Or as I'll call them from here on, 'generalization propensities'. 

I have two high-level goals: 

  1. Understanding the complete set of causal factors that drive generalization.
  2. Controlling generalization by intervening on these causal factors in a principled way. 

Defining "generalization propensity" 

To study generalization propensities, we need two things: 

  1. "Generalization propensity evaluations" (GPEs)
  2. Training-time interventions

I define a GPE as a way to measure how models generalise OOD from weak supervision signal. Minimally, this consists of a bundled (narrow training signal, object-level trait eval). My go-to example is emergent misalignment and other types of misalignment generalization. Obviously it's good to get as close as possible to the kinds of misaligned policies outlined above. 

I define a training-time intervention as any way we can consider modifying the training process to change an LLM's inductive biases. This includes things like character training, filtering the pretraining data, conditional pretraining, gradient routing, and inoculation prompting, among others. 

Research questions 

Some broad and overlapping things I'm interested in are: 

  1. What are models' generalization propensities? Let's accumulate a diverse suite of GPEs, each including a training signal + trait eval, and do something akin to 'personality profiling'
  2. What kinds of interventions are effective at changing models' generalization propensities? Let's test lots of them, see what happens.
  3. How do different interventions compose? E.g. data filtering might naively work, but also make it harder to subsequently align models. What does the best 'full stack' intervention look like?
  4. Ambitiously, can we instill generalization propensities robustly? Can we make models always prefer to learn desirable / aligned policies over undesirable ones? Can this be made tamper-resistant

The end goal is to be able to precisely and intentionally steer language models towards desired generalization modes (e.g. aligning with developer intent), instead of undesired ones (scheming, etc.)