WOLF: Werewolf-based Observations for LLM Deception and Falsehoods

December 1, 2025

Accepted to MTLILM @ NeurIPS 2025 (Spotlight)

Authors: Mrinal Agarwal, Saad Rana, Theo Sundoro, Hermela Berhe

We introduce WOLF (Weighted Online Learning Framework), a novel framework for evaluating large language model robustness in multi-turn conversational settings under adversarial pressure. WOLF systematically tests model responses when users attempt to manipulate, mislead, or extract harmful information through extended dialogue. Our framework provides quantitative metrics for measuring model stability and safety across conversation trajectories.

Begin Your Journey

The application takes 10 minutes and is reviewed on a rolling basis. We look for strong technical signal—projects, coursework, or competition results—and a genuine curiosity to do real research.

If admitted, you will join a structured pipeline with direct mentorship to take your work from ideation to top conference submission at venues like NeurIPS, ACL, and EMNLP.

Begin Application Financial Aid