Abstract
Current large language models struggle to answer questions that span tens of thousands of tokens, especially when multi-hop reasoning is involved. We introduce NovelHopQA, the first benchmark to evaluate 1-4 hop QA over 64k-128k-token excerpts from 83 full-length public-domain novels. A keyword-guided pipeline builds hop-separated chains grounded in coherent storylines. The benchmark contains 4,000 multi-hop QA examples. Seven state-of-the-art models were evaluated, revealing consistent accuracy drops with increased hops and context length, even for frontier models. Failure-mode analysis highlights common breakdowns such as missed final-hop integration and long-range drift.