Skip to main content

Deadline Extended: Sunday, June 7 @ 11:59pm PT. May 24 cohort is now waitlisted; June 6 cohort closing soon. Click to apply.

AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI

AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI

December 1, 2025

Goal changes are a defining feature of real-world multi-turn interactions, yet current agent benchmarks primarily evaluate static objectives or one-shot tool use. We introduce AgentChangeBench, a benc...

Accepted to LAW @ NeurIPS 2025

Authors: Manik Rana, Calissa Man, Jeffrey Paine, Anotida Expected Msiiwa

Goal changes are a defining feature of real-world multi-turn interactions, yet current agent benchmarks primarily evaluate static objectives or one-shot tool use. We introduce AgentChangeBench, a benchmark explicitly designed to measure how tool-augmented language model agents adapt to mid-dialogue goal shifts across three enterprise domains. The framework formalizes evaluation through four complementary metrics: Task Success Rate (TSR) for effectiveness, Tool Use Efficiency (TUE) for reliability, Tool Call Redundancy Rate (TCRR) for wasted effort, and Goal-Shift Recovery Time (GSRT) for adaptation latency. AgentChangeBench comprises 2,835 task sequences and five user personas, each designed to trigger realistic shift points in ongoing workflows. Evaluating several frontier models uncovered sharp contrasts: GPT-4o reaches 92.2% recovery on airline booking shifts while Gemini collapses to 48.6%, demonstrating that high raw accuracy does not imply robustness under dynamic goals.

Begin Your Journey

The application takes 5 minutes and is reviewed on a rolling basis. We look for strong technical signal—projects, coursework, or competition results—and a genuine curiosity to do real research.

If admitted, you will join a structured pipeline with direct mentorship to take your work from ideation to top conference submission at venues like NeurIPS, ACL, and EMNLP.

Begin Your Journey