Accepted to Building Trust in LLMs @ ICLR 2025

Finding Sparse Autoencoder Representations Of Errors In CoT Prompting

Justin Theodorus, V Swaytha, Shivani Gautam, Adam Ward, Mahir Shah

Abstract

Current large language models often suffer from subtle, hard-to-detect reasoning errors in their intermediate chain-of-thought (CoT) steps. These errors include logical inconsistencies, factual hallucinations, and arithmetic mistakes, which compromise trust and reliability. While previous research focuses on mechanistic interpretability for best output, understanding and categorizing internal reasoning errors remains challenging. We describe a methodology to uncover structured representations of reasoning errors in CoT prompting using Sparse Autoencoders, evaluating SAE activations within neural networks to investigate how specific neurons contribute to different types of errors.

Citation

Justin Theodorus, V Swaytha, Shivani Gautam, Adam Ward, Mahir Shah. "Finding Sparse Autoencoder Representations Of Errors In CoT Prompting". Accepted to Building Trust in LLMs @ ICLR 2025.

Resources

OpenReview

Details

Conference: Accepted to Building Trust in LLMs @ ICLR 2025
Authors: 5 authors

Related Publications

Explore more research from Algoverse

NeurIPS 2025 (Spotlight)

Publish Your Research

Join Algoverse and work with world-class mentors to publish at top AI conferences.

Start Your Application