Skip to main content

Spring Deadline: Sunday, March 1 @ 11:59pm PT. Click here to apply.

What Do Refusal Tokens Learn? Fine-Grained Analysis of Refusal Representations in LLMs

What Do Refusal Tokens Learn? Fine-Grained Analysis of Refusal Representations in LLMs

December 1, 2025

We investigate whether categorical refusal tokens enable controllable, interpretable refusal behavior in language models. Using a fine-tuned version of LLAMA-3 8B BASE with categorical refusal tokens,...

Accepted to Mech Interp @ NeurIPS 2025

Authors: Rishab Alagharu, Ishneet Sukhvinder Singh, Anjali Batta, Jaelyn S. Liang, Shaibi Shamsudeen, Arnav Sheth

We investigate whether categorical refusal tokens enable controllable, interpretable refusal behavior in language models. Using a fine-tuned version of LLAMA-3 8B BASE with categorical refusal tokens, we extract residual-stream activations and compute category-specific steering vectors. Our contributions include extracting category-specific refusal steering vectors, providing empirical evidence that categorical steering reduces over-refusal on ambiguous and benign prompts while preserving refusal on harmful ones across safety benchmarks, and analysis showing that the identified refusal features are distinct, interpretable, and arise from refusal-token fine-tuning.

Begin Your Journey

The application takes 10 minutes and is reviewed on a rolling basis. We look for strong technical signal—projects, coursework, or competition results—and a genuine curiosity to do real research.

If admitted, you will join a structured pipeline with direct mentorship to take your work from ideation to top conference submission at venues like NeurIPS, ACL, and EMNLP.