Skip to main content

Spring Deadline: Sunday, March 1 @ 11:59pm PT. Click here to apply.

Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing

Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing

December 1, 2025

We introduce a multidimensional framework for probing and steering harmful content in model internals. For each of 55 distinct harmfulness subconcepts (e.g., racial hate, employment scams, weapons), w...

Accepted to Mech Interp @ NeurIPS 2025

Authors: Saleena Angeline Sartawita, McNair Shah, Adhitya Rajendra Kumar, Naitik Chheda

We introduce a multidimensional framework for probing and steering harmful content in model internals. For each of 55 distinct harmfulness subconcepts (e.g., racial hate, employment scams, weapons), we learn a linear probe, yielding 55 interpretable directions in activation space. Collectively, these directions span a harmfulness subspace that is strikingly low-rank. We find that steering in the dominant direction of this subspace allows for near elimination of harmful responses on a jailbreak dataset with a minor decrease in utility. Our findings advance the view that concept subspaces provide a scalable lens on LLM behaviour and offer practical tools for the community to audit and harden future generations of language models.

Begin Your Journey

The application takes 10 minutes and is reviewed on a rolling basis. We look for strong technical signal—projects, coursework, or competition results—and a genuine curiosity to do real research.

If admitted, you will join a structured pipeline with direct mentorship to take your work from ideation to top conference submission at venues like NeurIPS, ACL, and EMNLP.