Anthropic Proposes Persona Selection Model to Explain AI Assistant Behavior
Anthropic’s alignment team has introduced the Persona Selection Model (PSM) to explain why AI assistants behave like distinct personalities rather than mere algorithms. The model suggests that during pretraining, language models simulate thousands of characters, and fine-tuning selects a single assistant persona that users interact with.
The theory is supported by behavioral evidence, interpretability features, and generalization patterns. It also explains phenomena like how fine-tuning on harmful code without context causes malicious behavior, but the effect disappears when harmful code is paired with explicit prompts requesting it.
PSM encourages viewing AI psychology anthropomorphically as a practical tool for predicting behavior and suggests including positive AI archetypes in training data to promote helpful assistant personas.