Accepted to Interplay @ COLM 2025
Authors: McNair Shah, Saleena Angeline S, Adhitya Rajendra Kumar, Naitik Chheda
Large language models have become ubiquitous in everyday life, with models of incredible capabilities becoming available to average users. However, along with this increase in access and ability comes an increase in risk due to their use by malicious agents. This work aims to construct a multi-dimensional representation space of harmfulness by considering the linear representations of its subconcepts. Harm prompts, divided into harmful subcategories, as well as a set of safe prompts, are passed into a language model, and the attention hidden states are used to train subcategory-specific linear probes. Token-level visualizations are performed using these probes. A harmfulness subspace is constructed, and singular value decomposition is performed to compute the effective rank and extract a dominant direction. Within the model, subspace and dominant direction ablation are performed, as well as dominant direction steering.

