Accepted to Mech Interp @ NeurIPS 2025
Authors: Saleena Angeline Sartawita, McNair Shah, Adhitya Rajendra Kumar, Naitik Chheda
We introduce a multidimensional framework for probing and steering harmful content in model internals. For each of 55 distinct harmfulness subconcepts (e.g., racial hate, employment scams, weapons), we learn a linear probe, yielding 55 interpretable directions in activation space. Collectively, these directions span a harmfulness subspace that is strikingly low-rank. We find that steering in the dominant direction of this subspace allows for near elimination of harmful responses on a jailbreak dataset with a minor decrease in utility. Our findings advance the view that concept subspaces provide a scalable lens on LLM behaviour and offer practical tools for the community to audit and harden future generations of language models.

