Skip to main content

Spring Deadline: Sunday, March 1 @ 11:59pm PT. Click here to apply.

Back to Research
Accepted to LCFM @ ICML 2025

NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts

Abhay Gupta

Abstract

Current large language models struggle to answer questions that span tens of thousands of tokens, especially when multi-hop reasoning is involved. We introduce NovelHopQA, the first benchmark to evaluate 1-4 hop QA over 64k-128k-token excerpts from 83 full-length public-domain novels. A keyword-guided pipeline builds hop-separated chains grounded in coherent storylines. The benchmark contains 4,000 multi-hop QA examples. Seven state-of-the-art models were evaluated, revealing consistent accuracy drops with increased hops and context length, even for frontier models. Failure-mode analysis highlights common breakdowns such as missed final-hop integration and long-range drift.

Citation

Abhay Gupta. "NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts". Accepted to LCFM @ ICML 2025.

Details

Conference
Accepted to LCFM @ ICML 2025
Authors
1 author

Publish Your Research

Join Algoverse and work with world-class mentors to publish at top AI conferences.

Start Your Application