Abstract
We investigate the phenomenon of neuron universality in independently trained GPT-2 Small models, examining how these universal neurons—neurons with consistently correlated activations across models—emerge and evolve throughout training. By analyzing five GPT-2 models at three checkpoints (100k, 200k, 300k steps), we identify universal neurons through pairwise correlation analysis of activations over a dataset of 5 million tokens. Universal neurons emerge early, increasing consistently through training, notably in deeper layers. Universal neurons are highly stable over time, especially in later layers. Ablating universal neurons significantly increases loss and KL divergence, confirming their causal importance to model predictions. Layer-wise ablation reveals that ablating universal neurons in the first layer causes a disproportionately large increase in both KL divergence and loss.