Skip to main content

Spring Deadline: Sunday, February 15 at 11:59 pm PT. Click here to apply.

Back to Research
Accepted to Mech Interp @ NeurIPS 2025

What Do Refusal Tokens Learn? Fine-Grained Analysis of Refusal Representations in LLMs

Rishab Alagharu, Ishneet Sukhvinder Singh, Anjali Batta, Jaelyn S. Liang, Shaibi Shamsudeen, Arnav Sheth

Abstract

We investigate whether categorical refusal tokens enable controllable, interpretable refusal behavior in language models. Using a fine-tuned version of LLAMA-3 8B BASE with categorical refusal tokens, we extract residual-stream activations and compute category-specific steering vectors. Our contributions include extracting category-specific refusal steering vectors, providing empirical evidence that categorical steering reduces over-refusal on ambiguous and benign prompts while preserving refusal on harmful ones across safety benchmarks, and analysis showing that the identified refusal features are distinct, interpretable, and arise from refusal-token fine-tuning.

Citation

Rishab Alagharu, Ishneet Sukhvinder Singh, Anjali Batta, Jaelyn S. Liang, Shaibi Shamsudeen, Arnav Sheth. "What Do Refusal Tokens Learn? Fine-Grained Analysis of Refusal Representations in LLMs". Accepted to Mech Interp @ NeurIPS 2025.

Resources

Details

Conference
Accepted to Mech Interp @ NeurIPS 2025
Authors
6 authors

Publish Your Research

Join Algoverse and work with world-class mentors to publish at top AI conferences.

Start Your Application