Accepted to Reliable ML @ NeurIPS 2025

A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy

Claire O'Brien, Jessica Seto, Dristi Roy, Aditya Dwivedi

Abstract

Behavioral alignment in large language models (LLMs) is often achieved through broad fine-tuning, which can result in undesired side effects like distributional shift and low interpretability. We propose a method for alignment that identifies and updates only the neurons most responsible for a given behavior, a targeted approach that allows for fine-tuning with significantly less data. Using sparse autoencoders (SAEs) and linear probes, we isolate the 3% of MLP neurons most predictive of a target behavior, decode them into residual space, and fine-tune only those neurons using gradient masking. We demonstrate our approach on reducing sycophantic behavior, matching or exceeding state-of-the-art performance on four benchmarks.

Citation

Claire O'Brien, Jessica Seto, Dristi Roy, Aditya Dwivedi. "A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy". Accepted to Reliable ML @ NeurIPS 2025.

Resources

View on arXiv

Details

Conference: Accepted to Reliable ML @ NeurIPS 2025
Authors: 4 authors

Related Publications

Explore more research from Algoverse

NeurIPS 2025 (Spotlight)

Publish Your Research

Join Algoverse and work with world-class mentors to publish at top AI conferences.

Start Your Application