Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing

Accepted to Interplay @ COLM 2025

Authors: McNair Shah, Saleena Angeline S, Adhitya Rajendra Kumar, Naitik Chheda

Large language models have become ubiquitous in everyday life, with models of incredible capabilities becoming available to average users. However, along with this increase in access and ability comes an increase in risk due to their use by malicious agents. This work aims to construct a multi-dimensional representation space of harmfulness by considering the linear representations of its subconcepts. Harm prompts, divided into harmful subcategories, as well as a set of safe prompts, are passed into a language model, and the attention hidden states are used to train subcategory-specific linear probes. Token-level visualizations are performed using these probes. A harmfulness subspace is constructed, and singular value decomposition is performed to compute the effective rank and extract a dominant direction. Within the model, subspace and dominant direction ablation are performed, as well as dominant direction steering.

Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing

Begin Your Journey