Abstract
We investigate whether categorical refusal tokens enable controllable, interpretable refusal behavior in language models. Using a fine-tuned version of LLAMA-3 8B BASE with categorical refusal tokens, we extract residual-stream activations and compute category-specific steering vectors. Our contributions include extracting category-specific refusal steering vectors, providing empirical evidence that categorical steering reduces over-refusal on ambiguous and benign prompts while preserving refusal on harmful ones across safety benchmarks, and analysis showing that the identified refusal features are distinct, interpretable, and arise from refusal-token fine-tuning.