Accepted to Mech Interp @ NeurIPS 2025

Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing

Saleena Angeline Sartawita, McNair Shah, Adhitya Rajendra Kumar, Naitik Chheda

Abstract

We introduce a multidimensional framework for probing and steering harmful content in model internals. For each of 55 distinct harmfulness subconcepts (e.g., racial hate, employment scams, weapons), we learn a linear probe, yielding 55 interpretable directions in activation space. Collectively, these directions span a harmfulness subspace that is strikingly low-rank. We find that steering in the dominant direction of this subspace allows for near elimination of harmful responses on a jailbreak dataset with a minor decrease in utility. Our findings advance the view that concept subspaces provide a scalable lens on LLM behaviour and offer practical tools for the community to audit and harden future generations of language models.

Citation

Saleena Angeline Sartawita, McNair Shah, Adhitya Rajendra Kumar, Naitik Chheda. "Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing". Accepted to Mech Interp @ NeurIPS 2025.

Resources

View on arXiv

Details

Conference: Accepted to Mech Interp @ NeurIPS 2025
Authors: 4 authors

Related Publications

Explore more research from Algoverse

NeurIPS 2025 (Spotlight)

Publish Your Research

Join Algoverse and work with world-class mentors to publish at top AI conferences.

Start Your Application