Accepted to MTLILM @ NeurIPS 2025 (Spotlight)Spotlight

WOLF: Werewolf-based Observations for LLM Deception and Falsehoods

Mrinal Agarwal, Saad Rana, Theo Sundoro, Hermela Berhe

Abstract

We introduce WOLF (Weighted Online Learning Framework), a novel framework for evaluating large language model robustness in multi-turn conversational settings under adversarial pressure. WOLF systematically tests model responses when users attempt to manipulate, mislead, or extract harmful information through extended dialogue. Our framework provides quantitative metrics for measuring model stability and safety across conversation trajectories.

Citation

Mrinal Agarwal, Saad Rana, Theo Sundoro, Hermela Berhe. "WOLF: Werewolf-based Observations for LLM Deception and Falsehoods". Accepted to MTLILM @ NeurIPS 2025 (Spotlight).

Resources

Details

Conference: Accepted to MTLILM @ NeurIPS 2025 (Spotlight)
Authors: 4 authors
Recognition: Spotlight Paper

Related Publications

Explore more research from Algoverse

NeurIPS 2025 (Spotlight)

Publish Your Research

Join Algoverse and work with world-class mentors to publish at top AI conferences.

Start Your Application