Accepted to LAW @ NeurIPS 2025
Authors: Manik Rana, Calissa Man, Jeffrey Paine, Anotida Expected Msiiwa
Goal changes are a defining feature of real-world multi-turn interactions, yet current agent benchmarks primarily evaluate static objectives or one-shot tool use. We introduce AgentChangeBench, a benchmark explicitly designed to measure how tool-augmented language model agents adapt to mid-dialogue goal shifts across three enterprise domains. The framework formalizes evaluation through four complementary metrics: Task Success Rate (TSR) for effectiveness, Tool Use Efficiency (TUE) for reliability, Tool Call Redundancy Rate (TCRR) for wasted effort, and Goal-Shift Recovery Time (GSRT) for adaptation latency. AgentChangeBench comprises 2,835 task sequences and five user personas, each designed to trigger realistic shift points in ongoing workflows. Evaluating several frontier models uncovered sharp contrasts: GPT-4o reaches 92.2% recovery on airline booking shifts while Gemini collapses to 48.6%, demonstrating that high raw accuracy does not imply robustness under dynamic goals.

