Skip to main content

Deadline Extended: Sunday, June 7 @ 11:59pm PT. May 24 cohort is now waitlisted; June 6 cohort closing soon. Click to apply.

Finding Sparse Autoencoder Representations Of Errors In CoT Prompting

Finding Sparse Autoencoder Representations Of Errors In CoT Prompting

December 1, 2025

Current large language models often suffer from subtle, hard-to-detect reasoning errors in their intermediate chain-of-thought (CoT) steps. These errors include logical inconsistencies, factual halluc...

Accepted to Building Trust in LLMs @ ICLR 2025

Authors: Justin Theodorus, V Swaytha, Shivani Gautam, Adam Ward, Mahir Shah

Current large language models often suffer from subtle, hard-to-detect reasoning errors in their intermediate chain-of-thought (CoT) steps. These errors include logical inconsistencies, factual hallucinations, and arithmetic mistakes, which compromise trust and reliability. While previous research focuses on mechanistic interpretability for best output, understanding and categorizing internal reasoning errors remains challenging. We describe a methodology to uncover structured representations of reasoning errors in CoT prompting using Sparse Autoencoders, evaluating SAE activations within neural networks to investigate how specific neurons contribute to different types of errors.

Begin Your Journey

The application takes 5 minutes and is reviewed on a rolling basis. We look for strong technical signal—projects, coursework, or competition results—and a genuine curiosity to do real research.

If admitted, you will join a structured pipeline with direct mentorship to take your work from ideation to top conference submission at venues like NeurIPS, ACL, and EMNLP.

Begin Your Journey