Skip to main content

Spring Deadline: Sunday, March 1 @ 11:59pm PT. Click here to apply.

Back to Research
Accepted to MTLILM @ NeurIPS 2025 (Spotlight)Spotlight

WOLF: Werewolf-based Observations for LLM Deception and Falsehoods

Mrinal Agarwal, Saad Rana, Theo Sundoro, Hermela Berhe

Abstract

We introduce WOLF (Weighted Online Learning Framework), a novel framework for evaluating large language model robustness in multi-turn conversational settings under adversarial pressure. WOLF systematically tests model responses when users attempt to manipulate, mislead, or extract harmful information through extended dialogue. Our framework provides quantitative metrics for measuring model stability and safety across conversation trajectories.

Citation

Mrinal Agarwal, Saad Rana, Theo Sundoro, Hermela Berhe. "WOLF: Werewolf-based Observations for LLM Deception and Falsehoods". Accepted to MTLILM @ NeurIPS 2025 (Spotlight).

Resources

Details

Conference
Accepted to MTLILM @ NeurIPS 2025 (Spotlight)
Authors
4 authors
Recognition
Spotlight Paper

Publish Your Research

Join Algoverse and work with world-class mentors to publish at top AI conferences.

Start Your Application