Accepted to CogInterp @ NeurIPS 2025
Authors: Vittesh Maganti, Nysa Lalye, Ethan Braverman
As AI systems become more sophisticated, the ability to detect deceptive or manipulative language becomes increasingly important for safety. We introduce DecepBench, a benchmark designed to evaluate the capacity of language models to identify deceptive statements across multiple dimensions including intentional misdirection, selective omission, and strategic ambiguity. DecepBench comprises 8,000 examples sourced from negotiation transcripts, political discourse, and synthetic scenarios. Our evaluation reveals that current models perform poorly on subtle forms of deception, highlighting a critical gap in AI safety research. We provide detailed error analysis and propose directions for improving deception detection capabilities.

