Skip to main content

Spring Deadline: Sunday, March 1 @ 11:59pm PT. Click here to apply.

Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing

Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing

December 1, 2025

Large language models have become ubiquitous in everyday life, with models of incredible capabilities becoming available to average users. However, along with this increase in access and ability comes...

Accepted to Interplay @ COLM 2025

Authors: McNair Shah, Saleena Angeline S, Adhitya Rajendra Kumar, Naitik Chheda

Large language models have become ubiquitous in everyday life, with models of incredible capabilities becoming available to average users. However, along with this increase in access and ability comes an increase in risk due to their use by malicious agents. This work aims to construct a multi-dimensional representation space of harmfulness by considering the linear representations of its subconcepts. Harm prompts, divided into harmful subcategories, as well as a set of safe prompts, are passed into a language model, and the attention hidden states are used to train subcategory-specific linear probes. Token-level visualizations are performed using these probes. A harmfulness subspace is constructed, and singular value decomposition is performed to compute the effective rank and extract a dominant direction. Within the model, subspace and dominant direction ablation are performed, as well as dominant direction steering.

Begin Your Journey

The application takes 10 minutes and is reviewed on a rolling basis. We look for strong technical signal—projects, coursework, or competition results—and a genuine curiosity to do real research.

If admitted, you will join a structured pipeline with direct mentorship to take your work from ideation to top conference submission at venues like NeurIPS, ACL, and EMNLP.