Skip to main content

Spring Deadline: Sunday, March 1 @ 11:59pm PT. Click here to apply.

NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts

NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts

December 1, 2025

Current large language models struggle to answer questions that span tens of thousands of tokens, especially when multi-hop reasoning is involved. We introduce NovelHopQA, the first benchmark to evalu...

Accepted to LCFM @ ICML 2025

Authors: Abhay Gupta

Current large language models struggle to answer questions that span tens of thousands of tokens, especially when multi-hop reasoning is involved. We introduce NovelHopQA, the first benchmark to evaluate 1-4 hop QA over 64k-128k-token excerpts from 83 full-length public-domain novels. A keyword-guided pipeline builds hop-separated chains grounded in coherent storylines. The benchmark contains 4,000 multi-hop QA examples. Seven state-of-the-art models were evaluated, revealing consistent accuracy drops with increased hops and context length, even for frontier models. Failure-mode analysis highlights common breakdowns such as missed final-hop integration and long-range drift.

Begin Your Journey

The application takes 10 minutes and is reviewed on a rolling basis. We look for strong technical signal—projects, coursework, or competition results—and a genuine curiosity to do real research.

If admitted, you will join a structured pipeline with direct mentorship to take your work from ideation to top conference submission at venues like NeurIPS, ACL, and EMNLP.