Accepted to LCFM @ ICML 2025

NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts

Abhay Gupta

Abstract

Current large language models struggle to answer questions that span tens of thousands of tokens, especially when multi-hop reasoning is involved. We introduce NovelHopQA, the first benchmark to evaluate 1-4 hop QA over 64k-128k-token excerpts from 83 full-length public-domain novels. A keyword-guided pipeline builds hop-separated chains grounded in coherent storylines. The benchmark contains 4,000 multi-hop QA examples. Seven state-of-the-art models were evaluated, revealing consistent accuracy drops with increased hops and context length, even for frontier models. Failure-mode analysis highlights common breakdowns such as missed final-hop integration and long-range drift.

Citation

Abhay Gupta. "NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts". Accepted to LCFM @ ICML 2025.

Resources

View on arXiv

Details

Conference: Accepted to LCFM @ ICML 2025
Authors: 1 author

Related Publications

Explore more research from Algoverse

NeurIPS 2025 (Spotlight)

Publish Your Research

Join Algoverse and work with world-class mentors to publish at top AI conferences.

Start Your Application