Skip to main content

Spring Deadline: Sunday, March 1 @ 11:59pm PT. Click here to apply.

A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy

A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy

December 1, 2025

Behavioral alignment in large language models (LLMs) is often achieved through broad fine-tuning, which can result in undesired side effects like distributional shift and low interpretability. We prop...

Accepted to Reliable ML @ NeurIPS 2025

Authors: Claire O'Brien, Jessica Seto, Dristi Roy, Aditya Dwivedi

Behavioral alignment in large language models (LLMs) is often achieved through broad fine-tuning, which can result in undesired side effects like distributional shift and low interpretability. We propose a method for alignment that identifies and updates only the neurons most responsible for a given behavior, a targeted approach that allows for fine-tuning with significantly less data. Using sparse autoencoders (SAEs) and linear probes, we isolate the 3% of MLP neurons most predictive of a target behavior, decode them into residual space, and fine-tune only those neurons using gradient masking. We demonstrate our approach on reducing sycophantic behavior, matching or exceeding state-of-the-art performance on four benchmarks.

Begin Your Journey

The application takes 10 minutes and is reviewed on a rolling basis. We look for strong technical signal—projects, coursework, or competition results—and a genuine curiosity to do real research.

If admitted, you will join a structured pipeline with direct mentorship to take your work from ideation to top conference submission at venues like NeurIPS, ACL, and EMNLP.