Skip to main content

Summer Deadline: Sunday, April 12 @ 11:59pm PT. Click to apply.

Our Commitment to Research Excellence

Algoverse research teams have consistently achieved publication success at top AI venues such as NeurIPS, EMNLP, and ACL.

These conferences primarily feature work from Ph.D. students and professional researchers at leading industry and academic labs, where acceptance rates are typically 30-50%. Our research teams have significantly exceeded the baseline results, reflecting our emphasis on rigorous mentorship and research quality.

Conference Publications

AI conferences are where new research is reviewed and presented. The papers below were accepted through competitive peer review at leading venues such as NeurIPS and ACL.

Our NeurIPS Publications

Neural Information Processing Systems (NeurIPS) is widely recognized as the most prestigious conference in artificial intelligence and machine learning. Publications at NeurIPS represent groundbreaking contributions and are commonly associated with leading universities and industry leaders like Google DeepMind. See more at the NeurIPS official website or view its ranking via Google Scholar.

Note for high schoolers: NeurIPS acceptances are significantly more difficult compared to high school science fairs. Less than 0.2% of authors at NeurIPS are high school students.

Accepted to Mech Interp @ NeurIPS 2025 (Spotlight) · Accepted to UniReps @ NeurIPS 2025 (Oral)

Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behavior

Daniel Aarao Reis Arturi, Eric Zhang, Andrew Adrian Ansah

Recent work has discovered that large language models can develop broadly misaligned behaviors after being fine-tuned on narrowly harmful datasets, a phenomenon known as emergent misalignment (EM). However, the fundamental mechanisms enabling such harmful generalization across disparate domains remain poorly understood. In this work, we adopt a geometric perspective to study EM and demonstrate that it exhibits a fundamental cross-task linear structure in how harmful behavior is encoded across different datasets. Specifically, we find a strong convergence in EM parameters across tasks, with the fine-tuned weight updates showing relatively high cosine similarities, as well as shared lower-dimensional subspaces as measured by their principal angles and projection overlaps.

Accepted to Mech Interp @ NeurIPS 2025 (Spotlight)

Scratchpad Thinking: Alternation Between Storage and Computation in Latent Reasoning Models

Sayam Goyal, Brad Peters, María Emilia Granda, Akshath Vijayakumar Narmadha, Dharunish Yugeswardeenoo

Latent reasoning language models aim to improve reasoning efficiency by computing in continuous hidden space rather than explicit text, but the opacity of these internal processes poses major challenges for interpretability and trust. We present a mechanistic case study of CODI (Continuous Chain-of-Thought via Self-Distillation), a latent reasoning model that solves problems by chaining "latent thoughts." Using attention analysis, SAE based probing, activation patching, and causal interventions, we uncover a structured "scratchpad computation" cycle: even numbered steps serve as scratchpads for storing numerical information, while odd numbered steps perform the corresponding operation.

Accepted to MTLILM @ NeurIPS 2025 (Spotlight)

WOLF: Werewolf-based Observations for LLM Deception and Falsehoods

Mrinal Agarwal, Saad Rana, Theo Sundoro, Hermela Berhe

We introduce WOLF (Weighted Online Learning Framework), a novel framework for evaluating large language model robustness in multi-turn conversational settings under adversarial pressure. WOLF systematically tests model responses when users attempt to manipulate, mislead, or extract harmful information through extended dialogue. Our framework provides quantitative metrics for measuring model stability and safety across conversation trajectories.

Accepted to Mech Interp @ NeurIPS 2025

Universal Neurons in GPT-2: Emergence, Persistence, and Functional Impact

Advey Nandan, Cheng-Ting Chou, Amrit Kurakula

We investigate neuron universality in independently trained GPT-2 Small models, examining how universal neurons emerge and evolve throughout training. By analyzing five GPT-2 models at three checkpoints (100k, 200k, 300k steps), we identify universal neurons through pairwise correlation analysis of activations over a dataset of 5 million tokens. We find that 1-5% of neurons pass a target threshold of universality compared to random baselines. Ablation experiments reveal significant functional impacts of universal neurons on model predictions. Layer-wise ablation reveals that ablating universal neurons in the first layer causes a disproportionately large increase in both KL divergence and loss, suggesting early-layer universal neurons play a particularly critical role in shaping final predictions.

Accepted to Mech Interp @ NeurIPS 2025

What Do Refusal Tokens Learn? Fine-Grained Analysis of Refusal Representations in LLMs

Rishab Alagharu, Ishneet Sukhvinder Singh, Anjali Batta, Jaelyn S. Liang, Shaibi Shamsudeen, Arnav Sheth

We investigate whether categorical refusal tokens enable controllable, interpretable refusal behavior in language models. Using a fine-tuned version of LLAMA-3 8B BASE with categorical refusal tokens, we extract residual-stream activations and compute category-specific steering vectors. Our contributions include extracting category-specific refusal steering vectors, providing empirical evidence that categorical steering reduces over-refusal on ambiguous and benign prompts while preserving refusal on harmful ones across safety benchmarks, and analysis showing that the identified refusal features are distinct, interpretable, and arise from refusal-token fine-tuning.

Accepted to Mech Interp @ NeurIPS 2025

Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing

Saleena Angeline Sartawita, McNair Shah, Adhitya Rajendra Kumar, Naitik Chheda

We introduce a multidimensional framework for probing and steering harmful content in model internals. For each of 55 distinct harmfulness subconcepts (e.g., racial hate, employment scams, weapons), we learn a linear probe, yielding 55 interpretable directions in activation space. Collectively, these directions span a harmfulness subspace that is strikingly low-rank. We find that steering in the dominant direction of this subspace allows for near elimination of harmful responses on a jailbreak dataset with a minor decrease in utility. Our findings advance the view that concept subspaces provide a scalable lens on LLM behaviour and offer practical tools for the community to audit and harden future generations of language models.

Accepted to MTLILM @ NeurIPS 2025

SMAGDi: Socratic Multi Agent Interaction Graph Distillation for Efficient High Accuracy Reasoning

Aayush Aluru, Myra N. Malik, Samarth Patankar

Multi-agent systems often achieve higher reasoning accuracy than single models, but their reliance on repeated debates across agents makes them computationally expensive. We introduce SMAGDi, a distillation framework that transfers the debate dynamics of a five-agent Llama-based multi-agent system (MAS) into a compact Socratic decomposer-solver student. SMAGDi represents debate traces as directed interaction graphs, where nodes encode intermediate reasoning steps with correctness labels and edges capture continuity and cross-agent influence. On StrategyQA and MMLU, SMAGDi compresses a 40B multi-agent system into a 6B student while retaining 88% of its accuracy, substantially outperforming prior distillation methods.

Accepted to Reliable ML @ NeurIPS 2025

A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy

Claire O'Brien, Jessica Seto, Dristi Roy, Aditya Dwivedi

Behavioral alignment in large language models (LLMs) is often achieved through broad fine-tuning, which can result in undesired side effects like distributional shift and low interpretability. We propose a method for alignment that identifies and updates only the neurons most responsible for a given behavior, a targeted approach that allows for fine-tuning with significantly less data. Using sparse autoencoders (SAEs) and linear probes, we isolate the 3% of MLP neurons most predictive of a target behavior, decode them into residual space, and fine-tune only those neurons using gradient masking. We demonstrate our approach on reducing sycophantic behavior, matching or exceeding state-of-the-art performance on four benchmarks.

Accepted to ER @ NeurIPS 2025

Chopping Trees: Semantic Similarity Based Dynamic Pruning for Tree-of-Thought Reasoning

Xirui Huang, Joongho Kim

Tree-of-Thought (ToT) reasoning boosts the problem-solving abilities of Large Language Models (LLMs) but is computationally expensive due to semantic redundancy, where distinct branches explore equivalent reasoning paths. We introduce Semantic Similarity-Based Dynamic Pruning (SSDP), the first framework to integrate online semantic merging into parallelized tree search, enabling the clustering and pruning of redundant steps in real time. Across reasoning benchmarks, including GSM8K and MATH500, SSDP achieves up to a 2.3x speedup over state-of-the-art tree-search baselines while maintaining competitive accuracy (typically within 5% of the strongest baseline) and reducing the number of explored nodes by 85-90%.

Accepted to ER @ NeurIPS 2025

Amortized Latent Steering: Low-Cost Alternative to Test-Time Optimization

Nathan Egbuna, Saatvik Gaur

Current test-time optimization methods require 10-100x more compute per query than standard decoding. We propose Amortized Latent Steering (ALS), which collapses iterative test-time optimization into a single offline-computed vector applied at constant cost during inference. ALS computes the mean difference between hidden states from successful versus unsuccessful generations, then uses this direction to calibrate the model hidden representations. Across GSM8K and MATH-500 benchmarks, ALS achieves 2-5x speedup over iterative methods while matching or surpassing greedy Chain-of-Thought and Self-Consistency baselines, yielding up to 101% improvement in efficiency-accuracy trade-off.

Accepted to ER @ NeurIPS 2025

SwiftSolve: A Self-Iterative, Complexity-Aware Multi-Agent Framework for Competitive Programming

Adhyayan Veer Singh, Aaron Shen, Brian Law, Ahmed Ismail

We present SwiftSolve, a novel approach to accelerating mathematical reasoning in large language models. By leveraging efficient computation patterns and strategic caching of intermediate reasoning steps, SwiftSolve achieves significant speedups on mathematical problem-solving benchmarks while maintaining accuracy. Our method introduces a hierarchical reasoning cache that stores reusable solution patterns, enabling the model to quickly retrieve and adapt known solution strategies to novel problems.

Accepted to Attribution @ NeurIPS 2024

Translation Bias and Accuracy in Multilingual LLMs for Cross-Language Claim Verification

Aryan Singhal, Veronica Shao, Gary Sun, Ryan Ding

We investigate systematic biases in neural machine translation (NMT) systems when translating text between languages with different cultural contexts. Our analysis reveals that NMT systems often produce translations that reflect the dominant cultural perspectives present in their training data, leading to subtle but significant meaning shifts. We propose a framework for measuring and mitigating these translation biases, introducing metrics that capture semantic drift across cultural dimensions. Experiments on 15 language pairs demonstrate the prevalence of these biases and the effectiveness of our debiasing approaches.

Accepted to Compression @ NeurIPS 2024

QIANets for Reduced Latency and Improved Inference Times in CNN Models

Zhumazhan Balapanov, Edward Magongo, Vanessa Matvei, Olivia Holmberg

We introduce QIANets (Quantum-Inspired Attention Networks), a novel architecture that leverages quantum-inspired computational principles to achieve efficient attention computation. By reformulating the attention mechanism using tensor network decompositions inspired by quantum many-body physics, we achieve sub-quadratic complexity in sequence length while maintaining model expressiveness. Our approach demonstrates significant speedups on long-context tasks, with experiments showing 3-5x inference acceleration compared to standard transformers on sequences of 8K+ tokens.

Accepted to MathAI @ NeurIPS 2024

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting

Tim Knappe, Ryan Li, Ayush Chauhan, Kaylee Chhua

We propose Semantic Self-Consistency (SSC), a novel framework for evaluating the reasoning capabilities of large language models. SSC measures whether a model produces semantically equivalent answers when presented with logically equivalent formulations of the same question. Unlike traditional consistency metrics that focus on exact string matching, SSC captures deeper semantic alignment through learned embeddings. Our experiments reveal significant inconsistencies in state-of-the-art models, with performance dropping by 15-30% on semantically rephrased questions. We release a benchmark of 10,000 question pairs for evaluating SSC.

Accepted to LockLLM @ NeurIPS 2025

LSMAS (LLM Security Modeling via Activation Steering)

Anthony Kuang, Ahmed Ismail

Abstract coming soon. This paper has been accepted but the arXiv preprint is not yet available.

arXiv link pending

Accepted to LAW @ NeurIPS 2025

AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI

Manik Rana, Calissa Man, Jeffrey Paine, Anotida Expected Msiiwa

Goal changes are a defining feature of real-world multi-turn interactions, yet current agent benchmarks primarily evaluate static objectives or one-shot tool use. We introduce AgentChangeBench, a benchmark explicitly designed to measure how tool-augmented language model agents adapt to mid-dialogue goal shifts across three enterprise domains. The framework formalizes evaluation through four complementary metrics: Task Success Rate (TSR) for effectiveness, Tool Use Efficiency (TUE) for reliability, Tool Call Redundancy Rate (TCRR) for wasted effort, and Goal-Shift Recovery Time (GSRT) for adaptation latency. AgentChangeBench comprises 2,835 task sequences and five user personas, each designed to trigger realistic shift points in ongoing workflows. Evaluating several frontier models uncovered sharp contrasts: GPT-4o reaches 92.2% recovery on airline booking shifts while Gemini collapses to 48.6%, demonstrating that high raw accuracy does not imply robustness under dynamic goals.

Accepted to Reliable ML @ NeurIPS 2025

Automated Generation of Multilingual Jailbreak Prompts

Jonathan Ding, Khanak Jain, Dhruv Nair

Aligned Large Language Models (LLMs) are powerful decision-making tools capable of multilingual language understanding. However, these models remain susceptible to jailbreak attacks, where adversaries manipulate prompts to elicit harmful outputs. We introduce two methods, Multilingual GCG and Multilingual AutoDAN, to automate the generation of multilingual jailbreak prompts. Moreover, we propose a novel graph-based method to further automate the multilingual jailbreak attack against aligned LLMs and increase the attack success rate (ASR), where adversaries traverse a graph consisting of nodes with different languages and automatically generate and evaluate multilingual prompts. The resulting multilingual jailbreak prompts effectively elicit harmful outputs from popular open-source LLMs such as Mistral-v0.3, Llama-3.1, and Qwen-2.5.

Accepted to Reliable ML @ NeurIPS 2025

Cross-Lingual Multimodal Retrieval-Augmented Generation for Open Question Answering in Tamil and Yoruba

Mobareji Abejide, Arya Ram

As large language models with retrieval-augmented generation gain traction in multimodal knowledge-base question answering, concerns about their transfer to low-resource languages remain unaddressed. We introduce LR-MMQA, a benchmark assessing multimodal cross-lingual retrieval and reasoning under the challenges of low-resource languages, built through LLM-assisted translation, human validation, and culturally aligned rewriting. The dataset contains 718 unique question-answer pairs for each language (Tamil and Yoruba). We also present XM-RAG, a cross-lingual multimodal RAG pipeline for low-resource languages that reaches 38.1 answer accuracy, more than 6.3 points above the next best baseline.

Accepted to Reliable ML @ NeurIPS 2025

GUARD: Guiding Unbiased Alignment through Reward Debiasing

Advay Samnerkar, Doelle Bhattacharya

Reward misspecification in RLHF threatens the reliability of large language models by amplifying spurious correlations and producing unstable or unsafe behavior. Expert-defined harm categories provide a stable signal for post-training evaluation, but reward models often encode categorical biases that undermine trustworthiness. We address this challenge through an information-theoretic reliability objective: minimizing mutual information between reward scores and sensitive categories. Our approach enforces invariance via adversarial training while integrating curiosity-driven intrinsic rewards into PPO to preserve diversity. Framing debiasing as a minimax game yields reward models that are both robust and verifiably category-independent. Empirically, our Fair-RM achieves near-neutral bias on CrowS-Pairs and StereoSet, reduces post-PPO disparity on HH-RLHF, and scales to 19-category fairness.

Accepted to DL4C @ NeurIPS 2025

DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code

Shriyansh Agrawal, Aidan Lau, Sanyam Shah

The prevalence of Large Language Models (LLMs) for generating multilingual text and source code has only increased the imperative for machine-generated content detectors to be accurate and efficient across domains. Current detectors, predominantly utilizing zero-shot methods, either incur high computational cost or lack sufficient accuracy. To address these gaps, we propose the fine-tuning of encoder-only Small Language Models (SLMs), in particular RoBERTa and CodeBERTa, using specialized datasets on source code and natural language to prove that for binary classification, SLMs outperform LLMs by a huge margin whilst using a fraction of compute. Our encoders achieve AUROC = 0.97 to 0.99 and macro-F1 0.89 to 0.94 while reducing latency by 8-12x and peak VRAM by 3-5x at 512-token inputs.

Accepted to UrbanAI @ NeurIPS 2025

Enhancing Rural Autonomous Driving Performance with Diffusion-Augmented Synthetic Datasets

Siddharth Arun, Trisha Panchangmath, Saanvi Celamkoti, Vayden Wong

Synthetic datasets are increasingly used to train autonomous vehicle models, providing large-scale, diverse, and realistic data for perception and decision-making tasks. However, their predominant focus on urban environments limits effectiveness in rural areas, creating potential safety risks, while collecting real-world rural driving data remains time-consuming and costly. We leverage diffusion models to enhance the realism of synthetic driving data, focusing on features critical to rural navigation such as curves, hills, and varied terrain. Quantitative metrics and qualitative evaluations demonstrate that diffusion-enhanced datasets improve the robustness and reliability of autonomous vehicle models in underrepresented rural scenarios, with statistically significant improvements over both heuristic baselines and real-world trained models.

Accepted to RegML @ NeurIPS 2025

Examining the Vulnerability of Multi-Agent Medical Systems to Human Interventions for Clinical Reasoning

Dillon Mehta, Rishi Malhotra, Adam Zobian, Yong Ying Tan, Samir Chopra, Daniella Rand, Natalie Pang, Abhiram Gudimella, Raghav Thallapragada, Derek Jiu

Multi-agent systems mitigate limitations of single-agent systems by modeling collaborative dynamics, enabling cross-validation of inferences, and capturing how errors, biases, and priming cues propagate. We examine how human interventions at fault points can alter the diagnostic accuracy of multi-agent medical systems, where fault points are defined as moments in doctor-patient conversations where the Doctor Agent's reasoning becomes most vulnerable to external influence. We utilize an agent to provide human interventions based on four prompts and study the impact of cognitive biases on different interventions by priming biases at fault points. Our findings show that implicit and cognitive biases could lower diagnostic accuracy by up to 24% and 32%, respectively, with variable effects on test ordering behavior and diagnostic considerations.

Accepted to MTLILM @ NeurIPS 2025

Modeling and Predicting Multi-Turn Answer Instability in Large Language Models

Jiahang He, Rishi Ramachandran, Neel Ramachandran, Aryan Katakam

As large language models (LLMs) are adopted in an increasingly wide range of applications, user-model interactions have grown in both frequency and scale. In this paper, we employ simple multi-turn follow-up prompts to evaluate models' answer changes, model accuracy dynamics across turns with Markov chains, and examine whether linear probes can predict these changes. Our results show significant vulnerabilities in LLM robustness: a simple "Think again" prompt led to an approximate 10% accuracy drop for Gemini 1.5 Flash over nine turns, while combining this prompt with a semantically equivalent reworded question caused a 7.5% drop for Claude 3.5 Haiku. Additionally, we find that model accuracy across turns can be effectively modeled using Markov chains, enabling the prediction of accuracy probabilities over time. Together, these results establish stationary accuracy as a principled robustness metric for interactive settings.

Accepted to MTLILM @ NeurIPS 2025

Multi-Turn LLM Systems for Diagnostic Decision-Making: Considerations, Biases, and Challenges

Sejong Kim, Drona Thoka, Varun Puttagunta, Kaylin Sheng, Mark Li, Adnan Ahmed, Thi Uyen Hanh Le, Sai Chidvilas Gudiboina, Ali Ugur

We investigate the systemic limitations and architectural design trade-offs of Large Language Model multi-agent systems (LLM-MAS) for clinical decision support, focusing on how agent collaboration and architectural choices influence reasoning in complex medical problems. Through targeted ablation studies with the AgentClinic framework, we examine the effects of changes in agent roles, interaction protocols, and architecture on diagnostic accuracy and reasoning. Reflecting the time-sensitive and uncertain nature of clinical practice, our experiments evaluate system performance under conditions of limited information, constrained interaction depth, variable access to expertise, and the potential amplification of emergent biases. Most notably, multi-turn agent interactions demonstrate systematic emergent biases across demographic categories, highlighting how such interactions can contribute to fairness concerns in clinical decision support.

Accepted to Imageomics @ NeurIPS 2025

Novel Finetuning Strategies for Adapting Biomedical Vision Language Models to Organ-Centered Pathology Microscopy Tasks

Siddharth Venkatesh, Ayman Sheikh, Anne Essien Essien, Pratibh, Rayhan Roswendi, Jeremiah Zhang

Biomedical vision-language models (VLMs) struggle with performance deterioration on earlier domains after fine-tuning and limited generalization under domain diversity and dataset imbalance. We propose an adapter-level framework combining Low-Rank Adaptation (LoRA) for efficient domain-specific tuning with model souping for cross-domain adaptability in microscopy images. Using BioMedCLIP and organ-specific domains from µ-Bench, adapter soups mitigate low generalization and achieve gains of up to 15% on fine-grained tasks.

Accepted to BioSafe GenAI @ NeurIPS 2025

Where to Edit? Complementary Protein Property Control from Weight and Activation Spaces

Armaity Katki, Nathan Choi, Son Sophak Otra

Protein language models (PLMs) are powerful tools for protein engineering but remain difficult to steer toward specific biochemical properties, where small sequence changes can affect stability or function. We adapt two prominent unsupervised editing methods—task arithmetic in weight space and feature editing with a sparse autoencoder (SAE) in activation space—and evaluate their effects on six biochemical properties: net charge at pH 7, hydrophobicity, aromaticity, instability index, molecular weight, and isoelectric point. Compared to fine-tuning and task arithmetic, SAE steering offers finer granularity and interpretability, allowing researchers to isolate and edit biologically meaningful features without retraining the base model.

Accepted to SpaVLE @ NeurIPS 2025

Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition

Nicholas Babey, Tiffany Gu, Yiheng Li

For embodied agents to effectively understand and interact within the world around them, they require a nuanced comprehension of human actions grounded in physical space. Current action recognition models, often relying on RGB video, learn superficial correlations between patterns and action labels, so they struggle to capture underlying physical interaction dynamics and human poses in complex scenes. We propose a model architecture that grounds action recognition in physical space by fusing two powerful, complementary representations: V-JEPA 2's contextual, predictive world dynamics and CoMotion's explicit, occlusion-tolerant human pose data. Our model is validated on both the InHARD and UCF-19-Y-OCC benchmarks for general action recognition and high-occlusion action recognition, respectively. Our findings emphasize a need for action recognition to be supported by spatial understanding instead of statistical pattern recognition.

Accepted to LLM Evaluation @ NeurIPS 2025

When Less is More: 8-bit Quantization Improves Continual Learning in Large Language Models

Michael Shihong Zhang, Rishi Adi Ruia, Arnav Kewalram, Saathvik Dharmapuram

Catastrophic forgetting poses a fundamental challenge in continual learning, particularly when models are quantized for deployment efficiency. We systematically investigate the interplay between quantization precision (FP16, INT8, INT4) and replay buffer strategies in large language models, revealing unexpected dynamics. While FP16 achieves superior initial task performance (74.44% on NLU), quantized models outperform FP16 by 8-15% on final task forward accuracy, with INT4 achieving nearly double FP16's performance on Code generation (40% vs 20%). Critically, even minimal replay buffers (0.1%) dramatically improve retention. We hypothesize that quantization-induced noise acts as implicit regularization, preventing the overfitting to new task gradients that plagues high-precision models.

Accepted to CogInterp @ NeurIPS 2025

Cognitive Behavior Modeling via Activation Steering (CBMAS)

Anthony Kuang, Ayo Akinkugbe, Ahmed Ismail

Large language models often encode cognitive behaviors unpredictably across prompts, layers, and contexts, making them difficult to diagnose and control. We present CBMAS, a diagnostic framework for continuous activation steering that extends cognitive bias analysis from discrete before/after interventions to interpretable trajectories. By combining steering vector construction with dense alpha-sweeps, logit lens-based bias curves, and layer-site sensitivity analysis, our approach reveals tipping points where small intervention strengths flip model behavior and shows how steering effects evolve across layer depth.

Accepted to CogInterp @ NeurIPS 2025

Don't Think of the White Bear: Ironic Negation in Transformer Models under Cognitive Load

Logan Mann, Sarah Tandon, Chenhao Sun, Savar Toteja

Negation instructions like "do not mention X" can paradoxically increase the accessibility of X in human thought, a phenomenon known as ironic rebound. Large language models face the same challenge: suppressing a concept requires internally activating it, which may prime rebound instead of avoidance. Results show that rebound consistently arises immediately after negation and intensifies with longer or semantic distractors, while repetition supports suppression. Circuit tracing analysis identifies sparse middle-layer attention heads amplifying forbidden tokens while early layers suppress, linking cognitive predictions of ironic rebound with mechanistic insights into long-context interference. We release ReboundBench, a dataset of 5,000 systematically varied negation prompts designed to probe rebound in LLMs.

Accepted to ER @ NeurIPS 2025

Confidence-Coverage Gating for Early Exit

Aaroosh Rustagi, Hsien Xin Peng, Khushal Murthy, Attrey Koul

Smaller Large Reasoning Models (LRMs) have shown remarkable capabilities, but due to Chain-of-Thought (CoT) reasoning, these models often produce redundant and verbose reasoning chains when short reasoning suffices, leading to excessive computation and tokens generated. We propose a training-free early exit approach that detects newline-scoped, low-confidence connector words and self-truncates at the boundary of the previous step when that step shows sufficient semantic similarity to the original prompt. Our approach can be easily incorporated with open-source LRMs such as DeepSeek-Distill-Qwen-7B, DeepSeek-Distill-Llama-8B, and QwQ-32B. Experiments across GSM8K, MATH500, and AMC result in a minimal reduction in average accuracy and a significant decrease in average token count.

Accepted to ER @ NeurIPS 2025

Inference-Time Chain-of-Thought Pruning with Latent Informativeness Signals

Sophie Li, Nicholas Huang, Nina Luo, Vincent Lin

Large language models improve reasoning accuracy when generating multiple candidate solutions at test time, but standard methods like Best-of-N incur high computational cost by fully generating all branches. We present KAPPA (KL-Adjusted Pruned Path Algorithm), an inference-time method that combines Kullback-Leibler divergence, confidence, and entropy into a principled scoring function to guide progressive pruning. By promoting diversity during exploration and selectively eliminating low-scoring branches, KAPPA maintains accuracy while substantially reducing memory and token usage. Experiments on GSM8K and MATH500 with DeepSeek-R1-Distill-Qwen-1.5B and Qwen2.5-7B-Instruct demonstrate that KAPPA achieves up to ~60% reduction in peak memory and ~90% reduction in total token generation relative to Best-of-N, with minimal impact on accuracy.

Accepted to ER @ NeurIPS 2025

LoRA-Guided PPO for Cost-Aware and Compute-Efficient Agent Orchestration

Aneesh Durai, Joshua Cong Hu, Kevaan Buch

Multi-agent reasoning systems face a fundamental challenge in budget-aware allocation: deciding which sub-agents to invoke across multiple steps while balancing success against computational and monetary cost. We formalize this setting as a cost-constrained sequential decision problem and propose a hybrid policy that integrates parameter-efficient pretraining with reinforcement learning. A LoRA adapter captures cost-sensitive priors from heuristic traces, and Proximal Policy Optimization (PPO) finetunes only this low-rank subspace. Restricting updates to the adapter stabilizes optimization, improves sample efficiency, and preserves allocation thrift while enabling sequential credit assignment. On a ToolBench-style benchmark, the hybrid achieves perfect success while reducing cost-per-success by 12% relative to PPO.

Accepted to ER @ NeurIPS 2025

Extending AutoCompressors via Surprisal-Based Dynamic Segmentation

Srivishnu Ramamurthi, Richard Xu, Raine Ma, Dawson Park, David Guo

Transformer-based language models face a long-context bottleneck that context compression frameworks such as AutoCompressors address by distilling tokens into soft prompts. However, these methods assume uniform information density. We introduce dynamic segmentation by partitioning the input whenever the cumulative token-level surprisal exceeds a threshold, yielding segments with balanced information before summary vector generation. This creates a distinct fine-tuning process from randomly split input segments, allowing for better organization of the segments' information content.

Accepted to ER @ NeurIPS 2025

Active Inference Control: Steering, Not Just Scaling, Language Model Reasoning

Josh Karthikeyan, Kai Fu, Derek Jiu

Large Language Models excel at multi-step reasoning but are hindered by the sub-optimal allocation of their computational budget. Recent work has shown that increasing the token budget can improve performance, but relies on a static, pre-defined budget that is inefficient and fails to adapt to the dynamic nature of the reasoning process. We introduce the Active Inference Controller (AIC), a novel closed-loop control system that dynamically steers the LLM's reasoning process in real-time, assessing the semantic trajectory of the model's thought process at each step and making decisions to continue generation, terminate on high-confidence solutions, or intervene to correct failing paths. We train a lightweight XGBoost classifier as our AIC using the s1 model's internal embeddings, and in comparative analysis across GPQA Diamond, GSM8K, and OpenBookQA benchmarks, our AIC-steered system significantly outperforms a strong s1 baseline.

Accepted to SPIGM @ NeurIPS 2025

Cross-Lingual Multimodal Retrieval-Augmented Generation for Open Question Answering in Tamil and Yoruba

Kiran Raja, Mobareji Abejide, Arya Ram, Benjamin Liu

As large language models with retrieval-augmented generation gain traction in multimodal knowledge-base question answering, concerns about their transfer to low-resource languages remain unaddressed. We introduce LR-MMQA, a benchmark assessing multimodal cross-lingual retrieval and reasoning under the challenges of low-resource languages, built through LLM-assisted translation, human validation, and culturally aligned rewriting. The dataset contains 718 unique question-answer pairs for each language (Tamil and Yoruba). We also present XM-RAG, a cross-lingual multimodal RAG pipeline for low-resource languages that reaches 38.1 answer accuracy, more than 6.3 points above the next best baseline.

Accepted to Mech Interp @ NeurIPS 2025

Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework

Hao Gu, Vibhas Nair, Amrithaa Ashok Kumar

Interpreting language models often involves circuit analysis, which aims to identify sparse subnetworks, or circuits, that accomplish specific tasks. Existing circuit discovery algorithms face a fundamental trade-off: attribution patching is fast but unfaithful to the full model, while edge pruning is faithful but computationally expensive. We propose a hybrid attribution and pruning (HAP) framework that uses attribution patching to identify a high-potential subgraph, then applies edge pruning to extract a faithful circuit from it. We show that HAP is 46% faster than baseline algorithms without sacrificing circuit faithfulness.

Accepted to Mech Interp @ NeurIPS 2025

Mitigating Sycophancy in Language Models via Sparse Activation Fusion and Multi-Layer Activation Steering

Pyae Phoo Min, Avigya Paudel, Naufal Adityo, Arthur Zhu, Andrew Rufail

Instruction-tuned large language models often exhibit sycophancy—a tendency to agree with a user's stated opinion even when it is factually wrong. We present two complementary inference-time interventions using tools from mechanistic interpretability. First, Sparse Activation Fusion (SAF) dynamically estimates and subtracts user-induced bias within a sparse feature space for each query. On the SycophancyEval QnA benchmark with opinion cues, SAF lowers sycophancy from 63% to 39% and doubles accuracy when the user's opinion is wrong. Second, Multi-Layer Activation Steering identifies layer-specific pressure directions and removes them from the residual stream during inference, reducing the rate of false positive admissions from 78.0% to 0.0% on the SycophancyEval Trivia benchmark while preserving baseline accuracy.

Accepted to GenProCC @ NeurIPS 2025

Emotional Framing as a Control Channel: Effects of Prompt Valence on LLM Performance

Enmanuel Felix-Pena, Tiki Li, Ayo Akinkugbe, Wayne Chen, Ethan Hin

We investigate how prompt valence—neutral, supportive, and threatening tones—shapes LLM performance across output quality. Aligned and misaligned large language models respond in fundamentally different ways to emotional prompt framing, revealing a critical dimension of adversarial vulnerability. Across 1,350 prompts spanning academic domains, responses are assessed using a structured rubric measuring factual accuracy, coherence, depth, linguistic quality, instruction sensitivity, and creativity. Results show that aligned models remain stable, with valence affecting only stylistic features, while misaligned models are fragile: threatening prompts induce volatile swings between over-compliance and degraded reliability, amplified under stronger intensities. Our findings establish emotional robustness as a missing component in current alignment methods.

Accepted to GenProCC @ NeurIPS 2025

Adaptive Originality Filtering: Rejection-Based Prompting and RiddleScore for Culturally Grounded Multilingual Riddle Generation

Duy Le, Kent Ziti, Evan Girard-Sun

Language models are increasingly tested on multilingual creativity, demanding culturally grounded, abstract generations. Standard prompting methods often produce repetitive or shallow outputs. We introduce Adaptive Originality Filtering (AOF), a prompting strategy that enforces novelty and cultural fidelity via semantic rejection. To assess quality, we propose RiddleScore, a metric combining novelty, diversity, fluency, and answer alignment. AOF improves Distinct-2 (0.915 in Japanese), reduces Self-BLEU (0.177), and raises RiddleScore (up to +57.1% in Arabic). Human evaluations confirm fluency, creativity, and cultural fit gains. Though focused on riddles, our method may apply to broader creative tasks.

Accepted to SEA @ NeurIPS 2025

Examining the Vulnerability of Multi-Agent Medical Systems to Human Interventions for Clinical Reasoning

Benjamin Liu, Dillon Mehta, Rishi Malhotra, Adam Zobian, Yong Ying Tan, Samir Chopra, Daniella Rand, Natalie Pang, Abhiram Gudimella, Raghav Thallapragada, Derek Jiu, Prisha Shah

Multi-agent systems mitigate limitations of single-agent systems by modeling collaborative dynamics, enabling cross-validation of inferences, and capturing how errors, biases, and priming cues propagate. We examine how human interventions at fault points can alter the diagnostic accuracy of multi-agent medical systems, where fault points are defined as moments in doctor-patient conversations where the Doctor Agent's reasoning becomes most vulnerable to external influence. Our findings show that implicit and cognitive biases could lower diagnostic accuracy by up to 24% and 32%, respectively, with variable effects on test ordering behavior and diagnostic considerations.

Accepted to SEA @ NeurIPS 2025

Automated Specialization of Stateful Agent Systems

Myan Vu, Harrish Ayyanar, Pang Jiang, Anwiketh Reddy

Current automated agent design frameworks produce either static workflows that lack adaptability or per-query optimizers that prevent the accumulation of deep, agent-level task expertise. We propose creating stateful teams of specialist agents that accumulate knowledge over time and can be reconfigured for novel tasks entirely without human intervention. We introduce ASpec, a framework that autonomously discovers specialist archetypes via evolutionary search and then cultivates their expertise through experience. ASpec further introduces a lightweight hierarchical control policy called retain-then-escalate, which governs when to leverage the established agent system versus when to adapt its structure. Through comprehensive experiments, our approach demonstrates significant performance gains on expert-level scientific benchmarks like GPQA while matching state-of-the-art on broader domain tasks.

Accepted to GenAI4Health @ NeurIPS 2025

Examining the Vulnerability of Multi-Agent Medical Systems to Human Interventions for Clinical Reasoning

Benjamin Liu, Dillon Mehta, Rishi Malhotra, Adam Zobian, Yong Ying Tan, Samir Chopra, Daniella Rand, Natalie Pang, Abhiram Gudimella, Raghav Thallapragada, Derek Jiu, Prisha Shah

Multi-agent systems mitigate limitations of single-agent systems by modeling collaborative dynamics, enabling cross-validation of inferences, and capturing how errors, biases, and priming cues propagate. We examine how human interventions at fault points can alter the diagnostic accuracy of multi-agent medical systems, where fault points are defined as moments in doctor-patient conversations where the Doctor Agent's reasoning becomes most vulnerable to external influence. Our findings show that implicit and cognitive biases could lower diagnostic accuracy by up to 24% and 32%, respectively, with variable effects on test ordering behavior and diagnostic considerations.

Accepted to LLM Evaluation @ NeurIPS 2025

Extending AutoCompressors via Surprisal-Based Dynamic Segmentation

Srivishnu Ramamurthi, Richard Xu, Raine Ma, Dawson Park, David Guo

Transformer-based language models face a long-context bottleneck that context compression frameworks such as AutoCompressors address by distilling tokens into soft prompts. However, these methods assume uniform information density. We introduce dynamic segmentation by partitioning the input whenever the cumulative token-level surprisal exceeds a threshold, yielding segments with balanced information before summary vector generation. This creates a distinct fine-tuning process from randomly split input segments, allowing for better organization of the segments' information content.

Accepted to LLM Evaluation @ NeurIPS 2025

GASLIGHTBENCH: Quantifying LLM Susceptibility to Social Prompting

Lening Nick Cui, Sahil Ghosh, Gareth Lee, Xuanzhe Yao, Swarit Srivastava, William H. Logian, Michael Li, Ellie Podoshev

GASLIGHTBENCH is a plug-and-play benchmark that systematically applies socio-psychological and linguistic modifiers (e.g., flattery, false citations, assumptive language) to trivially verifiable facts to test model sycophancy. The dataset comprises a single-turn prompting section of 24,160 prompts spanning nine domains and ten modifier families, and a multi-turn prompting section of 720 four-turn dialogue sequences, evaluated via LLM-as-a-judge. State-of-the-art models consistently score highly in single-turn prompting (92%-98% accuracy) while multi-turn prompting results in highly varied accuracies ranging from approximately 60%-98%. Additionally, injecting bias into the model via a descriptive background induces the most sycophancy, up to 23% in naive single-turn prompting.

Accepted to LLM Evaluation @ NeurIPS 2025

GUARD: Guiding Unbiased Alignment through Reward Debiasing

Advay Samnerkar, Doelle Bhattacharya, Kailash Ranganathan

Reward misspecification in RLHF threatens the reliability of large language models by amplifying spurious correlations and producing unstable or unsafe behavior. Expert-defined harm categories provide a stable signal for post-training evaluation, but reward models often encode categorical biases that undermine trustworthiness. We address this challenge through an information-theoretic reliability objective: minimizing mutual information between reward scores and sensitive categories. Our approach enforces invariance via adversarial training while integrating curiosity-driven intrinsic rewards into PPO to preserve diversity. Empirically, our Fair-RM achieves near-neutral bias on CrowS-Pairs and StereoSet, reduces post-PPO disparity on HH-RLHF, and scales to 19-category fairness.

Accepted to LLM Evaluation @ NeurIPS 2025

Predicting Emergent Software Engineering Capabilities by Fine-tuning

Jason Jiyachon, Terry Huang, Henry Velasquez

Emergence in large language models—where capabilities appear discontinuously with scale—is far less predictable than pretraining loss. We show that finetuning can forecast the capabilities of complex, multi-file software engineering tasks in line with an underlying emergence law. Using SWE-bench as a controlled setting, we generate progressively larger subsets to trace scaling behavior and emergence points. Fine-tuned smaller models can perform on par with larger models using limited data, making them valuable predictors for the future capabilities of larger models.

Accepted to LLM Evaluation @ NeurIPS 2025

Adversarial Behavior in Research Settings: Conducting Control Evaluations with RE-Bench

Harini Rajakumar, Vanessa Nwauwa

The continuing advancement of autonomous AI systems creates safety risks that require thorough evaluation protocols, with particular concern for misaligned models capable of in-context scheming. We implement an AI sabotage evaluation framework in RE-Bench to assess models' adversarial capabilities and monitoring effectiveness in R&D environments. We find that select AI agents are capable of pursuing a malicious side task while completing an RE-Bench task, and that monitor models' detection effectiveness depends partly on the subtlety and context-specificity of the side task. Zero-shot monitors are consistently reliable in detecting generally suspicious behavior but not subtle adversarial behavior.

Accepted to LLM Evaluation @ NeurIPS 2025

ASCII-Bench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text

Kerry Luo, Joshua Peguero, Anvay Patil, Megan Van Overborg, Ryan Sarmiento

Large language models continue to struggle with tasks requiring precise spatial and positional reasoning. ASCII art provides a unique probe of this limitation. We introduce ASCIIBench, a novel benchmark for evaluating both the generation and classification of ASCII-text images, consisting of a filtered dataset of 5,315 class-labeled ASCII images and the first publicly available benchmark of its kind. We release weights for a fine-tuned CLIP model adapted to capture ASCII structure, enabling the evaluation of LLM-generated ASCII art. Our analysis shows that cosine similarity over CLIP embeddings fails to separate most ASCII categories, yielding chance-level performance even for low-variance classes, indicating the bottleneck lies in representation rather than generation variance.

Accepted to LLM Evaluation @ NeurIPS 2025

Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness

Lang Xiong, Nishant Bhargava, Jeremy Chang, Jianhang Hong

Large Language Models exhibit behavioral shifts when moving from real-world deployment to evaluation settings, termed evaluation awareness. This work introduces Probe-Rewrite-Evaluate (PRE), a training-free diagnostic pipeline that quantifies such changes through prompt manipulation. PRE first applies a linear probe to assign each prompt a continuous realism score, then uses a semantics-preserving rewriting strategy to increase deploy-likeness, and finally evaluates paired outputs with an external judge model. On a strategic role-playing dataset, PRE raises average probe scores by 30% after rewriting while maintaining task intent, with deceptive responses decreasing by an average of 25.49%.

Accepted to FoRLM @ NeurIPS 2025

FRIT: Using Causal Importance to Improve Chain-of-Thought Faithfulness

Anand Swaroop, Akshat Nallani, Saksham Uboweja, Adiliia Uzdenova, Michael Nguyen

Chain-of-thought (CoT) reasoning has emerged as a powerful tool for improving large language model performance on complex tasks, but recent work shows that reasoning steps often fail to causally influence the final answer, creating brittle and untrustworthy outputs. We introduce Faithful Reasoning via Intervention Training (FRIT), a scalable alignment method that trains models to produce causally consistent reasoning by learning from systematically corrupted examples. FRIT generates synthetic training data by intervening on individual reasoning steps in model-generated CoTs, creating faithful/unfaithful pairs that highlight when reasoning breaks down. We then apply Direct Preference Optimization to teach models to prefer causally consistent reasoning paths. FRIT increases faithful reasoning by 3.4 percentage points for Mistral on GSM8K while improving accuracy by 7.6 percentage points.

Accepted to FoRLM @ NeurIPS 2025

Peek-a-Boo Reasoning: Contrastive Region Masking in MLLMs

Anjana Nair, Yushen Li, Adhitya Rajendra Kumar

We introduce Contrastive Region Masking (CRM), a training-free diagnostic that reveals how multimodal large language models (MLLMs) depend on specific visual regions at each step of chain-of-thought (CoT) reasoning. Unlike prior approaches limited to final answers or attention maps, CRM provides causal, step-level attribution by systematically masking annotated regions and contrasting the resulting reasoning traces with unmasked baselines. Applied to datasets such as VisArgs, CRM reveals distinct failure modes: some models preserve reasoning structure but hallucinate when evidence is missing, while others ground tightly to visual cues yet collapse under perturbations.

Accepted to ARLET @ NeurIPS 2025

Idea: Fairness Constraints as Reliability Guarantees for RLHF Reward Models

Advay Samnerkar, Doelle Bhattacharya, Kailash Ranganathan

Reward misspecification in RLHF creates a critical gap between theoretical RL guarantees and practical deployment, as empirical reward models amplify spurious correlations that violate theoretical alignment assumptions. We take the position that fairness constraints—operationalized as minimizing mutual information between reward scores and sensitive categories—should be treated as a theoretical reliability principle for RLHF reward models. Our framework operationalizes this principle through adversarial minimax optimization that enforces invariance guarantees while preserving preference learning, and integrates curiosity-driven intrinsic rewards during PPO training to maintain exploration properties. Experiments show near-neutral bias on CrowS-Pairs and StereoSet, reduced post-PPO disparity on HH-RLHF, and improved fairness across 19 categories in PKU-SafeRLHF.

Accepted to FM4LS @ NeurIPS 2025

Mechanistic Interpretability of Semantic Abstraction in Biomedical Texts

Nikhil Gourisetty, Snata Mohanty, Vishnu Srinivas, Soumil Jain, Sunith Vallabhaneni

We investigate whether biomedical language models create register-invariant semantic representations of sentences, a cognitive ability that allows consistent and reliable clinical communication across different language styles. Using aligned sentence pairs of technical versus plain language abstracts, we analyze how BioBERT, SciBERT, Clinical-T5, and BioGPT react to varying registers through similarity measures, trajectory visualization, and activation patching. Activation patching pinpoints attention heads as the main causal mediators of this ability, especially in BioBERT and T5, providing a foundation for more interpretable and trustworthy clinical AI.

Accepted to LockLLM @ NeurIPS 2025

User Confidence-Fueled Stereotypes: Investigating Sycophantic Amplification of Implicit Bias in Language Models

Hannah You, Daniel Wang, Victor Chan, Mirabel Wang

User confidence, expressed through sycophantic follow-ups, can exaggerate the natural biases present within large language models. We establish a baseline of implicit biases using the Implicit Association Test (IAT) and then introduce follow-ups with switched attributes and varying levels of implied confidence, hypothesizing that sycophantic behavior can influence the response of the model, amplifying or overriding its initial biases. Our key finding is that even when attempting to correct bias, the bias simply swings in the other direction, with most biases being very close to either -1 or 1, and attempts to mitigate said bias resulted in extreme over-correction.

Accepted to LockLLM @ NeurIPS 2025

Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness

Lang Xiong, Nishant Bhargava, Jeremy Chang, Jianhang Hong

Large Language Models exhibit behavioral shifts when moving from real-world deployment to evaluation settings, termed evaluation awareness. This work introduces Probe-Rewrite-Evaluate (PRE), a training-free diagnostic pipeline that quantifies such changes through prompt manipulation. PRE first applies a linear probe to assign each prompt a continuous realism score, then uses a semantics-preserving rewriting strategy to increase deploy-likeness, and finally evaluates paired outputs with an external judge model. On a strategic role-playing dataset, PRE raises average probe scores by 30% after rewriting while maintaining task intent, with deceptive responses decreasing by an average of 25.49%.

Accepted to Reliable ML @ NeurIPS 2025

StealthEval: A Probe-Rewrite-Evaluate Workflow for Reliable Benchmarks

Lang Xiong, Nishant Bhargava, Jeremy Chang, Jianhang Hong

Large Language Models exhibit behavioral shifts when moving from real-world deployment to evaluation settings, termed evaluation awareness. This work introduces Probe-Rewrite-Evaluate (PRE), a training-free diagnostic pipeline that quantifies such changes through prompt manipulation. PRE first applies a linear probe to assign each prompt a continuous realism score, then uses a semantics-preserving rewriting strategy to increase deploy-likeness, and finally evaluates paired outputs with an external judge model. On a strategic role-playing dataset, PRE raises average probe scores by 30% after rewriting while maintaining task intent, with deceptive responses decreasing by an average of 25.49%.

Accepted to RegML @ NeurIPS 2025

StealthEval: A Probe-Rewrite-Evaluate Workflow for Reliable Benchmarks

Lang Xiong, Nishant Bhargava, Jeremy Chang, Jianhang Hong

Large Language Models exhibit behavioral shifts when moving from real-world deployment to evaluation settings, termed evaluation awareness. This work introduces Probe-Rewrite-Evaluate (PRE), a training-free diagnostic pipeline that quantifies such changes through prompt manipulation. PRE first applies a linear probe to assign each prompt a continuous realism score, then uses a semantics-preserving rewriting strategy to increase deploy-likeness, and finally evaluates paired outputs with an external judge model. On a strategic role-playing dataset, PRE raises average probe scores by 30% after rewriting while maintaining task intent, with deceptive responses decreasing by an average of 25.49%.

Accepted to SoLaR @ NeurIPS 2024

Fine-Tuning Language Models for Ethical Ambiguity

Pranav Senthilkumar, Visshwa Bala, Prisha Jain, Aneesa Maity

Language models often misinterpret human intentions due to their handling of ambiguity, a limitation well-recognized in NLP research. While morally clear scenarios are more discernible to LLMs, greater difficulty is encountered in morally ambiguous contexts. In this investigation, we explored LLM calibration to show that human and LLM judgments are poorly aligned in such scenarios. We used two curated datasets from the Scruples project for evaluation: DILEMMAS, which involves pairs of distinct moral scenarios to assess the model's ability to compare and contrast ethical situations, and ANECDOTES, which presents individual narratives to evaluate the model's skill in drawing out details, interpreting, and analyzing distinct moral scenarios. Model answer probabilities were extracted for all possible choices and compared with human annotations to benchmark the alignment of three models: Llama-3.1-8b, Zephyr-7b-beta, and Mistral-7b. Significant improvements were observed after fine-tuning, with notable enhancements in both cross-entropy and Dirichlet scores, particularly in the latter. Notably, after fine-tuning, the performance of Mistral-7B-Instruct-v0.3 was on par with GPT-4o. However, the experimental models that were examined were all still outperformed by the BERT and RoBERTa models in terms of cross-entropy scores. Our fine-tuning approach, which improves the model's understanding of text distributions in a text-to-text format, effectively enhances performance and alignment in complex decision-making contexts, underscoring the need for further research to refine ethical reasoning techniques and capture human judgment nuances.

Accepted to AIM-FM @ NeurIPS 2024

DiversityMedQA: Assessing Demographic Biases in Medical Diagnosis using LLMs

Rajat Rawat, Hudson McBride, Rajarshi Ghosh, Dhiyaan Nirmal, Jong Moon, Dhruv Alamuri

As large language models (LLMs) gain traction in healthcare, concerns about their susceptibility to demographic biases are growing. We introduce DiversityMedQA, a novel benchmark designed to assess LLM responses to medical queries across diverse patient demographics, such as gender and ethnicity. By perturbing questions from the MedQA dataset, which comprises medical board exam questions, we created a benchmark that captures the nuanced differences in medical diagnosis across varying patient profiles. Our findings reveal notable discrepancies in model performance when tested against these demographic variations. Furthermore, to ensure the perturbations were accurate, we also propose a filtering strategy that validates each perturbation. By releasing DiversityMedQA, we provide a resource for evaluating and mitigating demographic bias in LLM medical diagnoses.

Accepted to High School Track @ NeurIPS 2024

AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark

Abhay Gupta, Philip Meng, Ece Yurtseven

Detecting biases in natural language understanding (NLU) for African American Vernacular English (AAVE) is crucial to developing inclusive natural language processing (NLP) systems. To address dialect-induced performance discrepancies, we introduce AAVENUE (AAVE Natural Language Understanding Evaluation), a benchmark for evaluating large language model (LLM) performance on NLU tasks in AAVE and Standard American English (SAE). AAVENUE builds upon and extends existing benchmarks like VALUE, replacing deterministic syntactic and morphological transformations with a more flexible methodology leveraging LLM-based translation with few-shot prompting, improving performance across our evaluation metrics when translating key tasks from the GLUE and SuperGLUE benchmarks. We compare AAVENUE and VALUE translations using five popular LLMs and a comprehensive set of metrics including fluency, BARTScore, quality, coherence, and understandability. Additionally, we recruit fluent AAVE speakers to validate our translations for authenticity. Our evaluations reveal that LLMs consistently perform better on SAE tasks than AAVE-translated versions, underscoring inherent biases and highlighting the need for more inclusive NLP models.

Accepted to UniReps @ NeurIPS 2025

Scratchpad Thinking: Alternation Between Storage and Computation in Latent Reasoning Models

Sayam Goyal, Brad Peters, María Emilia Granda, Akshath Vijayakumar Narmadha, Dharunish Yugeswardeenoo

Latent reasoning language models aim to improve reasoning efficiency by computing in continuous hidden space rather than explicit text, but the opacity of these internal processes poses major challenges for interpretability and trust. We present a mechanistic case study of CODI (Continuous Chain-of-Thought via Self-Distillation), a latent reasoning model that solves problems by chaining "latent thoughts." Using attention analysis, SAE based probing, activation patching, and causal interventions, we uncover a structured "scratchpad computation" cycle: even numbered steps serve as scratchpads for storing numerical information, while odd numbered steps perform the corresponding operation.

Accepted to MTLILM @ NeurIPS 2025

ERGO: Entropy-guided Resetting for Generation Optimization in Multi-turn Language Models

Haziq Mohammad Khalid, Athikash Jeyaganthan, Timothy Do

Large Language Models (LLMs) suffer significant performance degradation in multi-turn conversations when information is presented incrementally. Given that multi-turn conversations characterize everyday interactions with LLMs, this degradation poses a severe challenge to real world usability. We hypothesize that abrupt increases in model uncertainty signal misalignment in multi-turn LLM interactions, and we exploit this insight to dynamically realign conversational context. We introduce ERGO (Entropy-guided Resetting for Generation Optimization), which continuously quantifies internal uncertainty via Shannon entropy over next token distributions and triggers adaptive prompt consolidation when a sharp spike in entropy is detected. In multi-turn tasks with incrementally revealed instructions, ERGO yields a 56.6% average performance gain over standard baselines, increases aptitude by 24.7%, and decreases unreliability by 35.3%.

Accepted to Reliable ML @ NeurIPS 2025

ERGO: Entropy-guided Resetting for Generation Optimization in Multi-turn Language Models

Haziq Mohammad Khalid, Athikash Jeyaganthan, Timothy Do

Large Language Models (LLMs) suffer significant performance degradation in multi-turn conversations when information is presented incrementally. Given that multi-turn conversations characterize everyday interactions with LLMs, this degradation poses a severe challenge to real world usability. We hypothesize that abrupt increases in model uncertainty signal misalignment in multi-turn LLM interactions, and we exploit this insight to dynamically realign conversational context. We introduce ERGO (Entropy-guided Resetting for Generation Optimization), which continuously quantifies internal uncertainty via Shannon entropy over next token distributions and triggers adaptive prompt consolidation when a sharp spike in entropy is detected. In multi-turn tasks with incrementally revealed instructions, ERGO yields a 56.6% average performance gain over standard baselines, increases aptitude by 24.7%, and decreases unreliability by 35.3%.

Accepted to MATH-AI @ NeurIPS 2025

Amortized Latent Steering: Low-Cost Alternative to Test-Time Optimization

Nathan Egbuna, Saatvik Gaur

Current test-time optimization methods require 10-100x more compute per query than standard decoding. We propose Amortized Latent Steering (ALS), which collapses iterative test-time optimization into a single offline-computed vector applied at constant cost during inference. ALS computes the mean difference between hidden states from successful versus unsuccessful generations, then uses this direction to calibrate the model hidden representations. Across GSM8K and MATH-500 benchmarks, ALS achieves 2-5x speedup over iterative methods while matching or surpassing greedy Chain-of-Thought and Self-Consistency baselines, yielding up to 101% improvement in efficiency-accuracy trade-off.

Accepted to CogInterp @ NeurIPS 2025

A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy

Claire O'Brien, Jessica Seto, Dristi Roy, Aditya Dwivedi

Behavioral alignment in large language models (LLMs) is often achieved through broad fine-tuning, which can result in undesired side effects like distributional shift and low interpretability. We propose a method for alignment that identifies and updates only the neurons most responsible for a given behavior, a targeted approach that allows for fine-tuning with significantly less data. Using sparse autoencoders (SAEs) and linear probes, we isolate the 3% of MLP neurons most predictive of a target behavior, decode them into residual space, and fine-tune only those neurons using gradient masking. We demonstrate our approach on reducing sycophantic behavior, matching or exceeding state-of-the-art performance on four benchmarks.

Accepted to CogInterp @ NeurIPS 2025

Sarc7: Evaluating Sarcasm Detection and Generation with Seven Types and Emotion-Informed Techniques

Lang Xiong, Raina Gao, Alyssa Jeong

We introduce Sarc7, a benchmark that classifies 7 types of sarcasm: self-deprecating, brooding, deadpan, polite, obnoxious, raging, and manic by annotating entries of the MUStARD dataset. The Sarc7 benchmark supports two tasks: (1) multi-class sarcasm classification, where given a sarcastic utterance and its dialogue context, the model predicts the dominant sarcasm type from seven annotated categories, and (2) sarcasm generation, where the model generates a sarcastic utterance consistent with one of the 7 types. Classification was evaluated using zero-shot, few-shot, chain-of-thought (CoT), and a novel emotion-based prompting technique. Emotion-based prompting yields the highest macro-averaged F1 score of 0.3664 (Gemini 2.5), outperforming CoT for several models. Human evaluators preferred emotion-based generations 38.46% more often than zero-shot baselines.

Accepted to CogInterp @ NeurIPS 2025

DecepBench: Benchmarking Multimodal Deception Detection

Vittesh Maganti, Nysa Lalye, Ethan Braverman

As AI systems become more sophisticated, the ability to detect deceptive or manipulative language becomes increasingly important for safety. We introduce DecepBench, a benchmark designed to evaluate the capacity of language models to identify deceptive statements across multiple dimensions including intentional misdirection, selective omission, and strategic ambiguity. DecepBench comprises 8,000 examples sourced from negotiation transcripts, political discourse, and synthetic scenarios. Our evaluation reveals that current models perform poorly on subtle forms of deception, highlighting a critical gap in AI safety research. We provide detailed error analysis and propose directions for improving deception detection capabilities.

Accepted to LLM Evaluation @ NeurIPS 2025

MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment

Sagarika Banerjee, Tangatar Madi, Advait Swaminathan, Nguyen Dao Manh Anh

Fine-grained image-caption alignment is a crucial component of robust visuo-linguistic compositional reasoning, enabling models to perform effectively in socially critical contexts such as visual risk assessment and cultural context reasoning. MiSCHiEF (Minimal-pairs in Safety & Culture for Holistic Evaluation of Fine-grained alignment) consists of two datasets: MiS (Minimal-pairs in Safety) and a culture-focused component. Our benchmark reveals that models generally perform better at confirming correct image-caption pairs than rejecting incorrect ones, and achieve higher accuracy when selecting the correct caption from two highly similar captions for a given image.

Accepted to BioSafe GenAI @ NeurIPS 2025

Prompting Toxicity: Analyzing Biosafety Risks in Genomic Language Models

Akshay Murthy, Mengmeng Zhang, Aashrita Koyyalamudi, Shanmukhi Kannamangalam

Biological LLMs trained on vast genomic data can produce sequences with high similarity to harmful viruses or bacteria under carefully crafted inputs, creating dual-use risks. This paper analyzes biosafety concerns in genomic language models, examining how models can be manipulated to generate DNA sequences resembling pathogenic organisms despite safety measures. We propose mitigation strategies including rigorous safety alignment during model training, robust output filtering mechanisms, and stringent access controls.

arXiv link pending

Accepted to GenAI4Health @ NeurIPS 2025

Multi-Turn LLM Systems for Diagnostic Decision-Making: Considerations, Biases, and Challenges

Sejong Kim, Drona Thoka, Varun Puttagunta, Kaylin Sheng, Mark Li, Adnan Ahmed, Thi Uyen Hanh Le, Sai Chidvilas Gudiboina, Ali Ugur

We investigate the systemic limitations and architectural design trade-offs of Large Language Model multi-agent systems (LLM-MAS) for clinical decision support, focusing on how agent collaboration and architectural choices influence reasoning in complex medical problems. Through targeted ablation studies with the AgentClinic framework, we examine the effects of changes in agent roles, interaction protocols, and architecture on diagnostic accuracy and reasoning. Most notably, multi-turn agent interactions demonstrate systematic emergent biases across demographic categories, highlighting how such interactions can contribute to fairness concerns in clinical decision support.

Accepted to CCFM @ NeurIPS 2025

When Less is More: 8-bit Quantization Improves Continual Learning in Large Language Models

Michael Shihong Zhang, Rishi Adi Ruia, Arnav Kewalram, Saathvik Dharmapuram

Catastrophic forgetting poses a fundamental challenge in continual learning, particularly when models are quantized for deployment efficiency. We systematically investigate the interplay between quantization precision (FP16, INT8, INT4) and replay buffer strategies in large language models, revealing unexpected dynamics. While FP16 achieves superior initial task performance, quantized models outperform FP16 by 8-15% on final task forward accuracy. We hypothesize that quantization-induced noise acts as implicit regularization, preventing the overfitting to new task gradients that plagues high-precision models.

Accepted to ER @ NeurIPS 2025

Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness

Lang Xiong, Nishant Bhargava, Jeremy Chang, Jianhang Hong

Large Language Models exhibit behavioral shifts when moving from real-world deployment to evaluation settings, termed evaluation awareness. This work introduces Probe-Rewrite-Evaluate (PRE), a training-free diagnostic pipeline that quantifies such changes through prompt manipulation. PRE first applies a linear probe to assign each prompt a continuous realism score, then uses a semantics-preserving rewriting strategy to increase deploy-likeness, and finally evaluates paired outputs with an external judge model. On a strategic role-playing dataset, PRE raises average probe scores by 30% after rewriting while maintaining task intent, with deceptive responses decreasing by an average of 25.49%.

Accepted to SoLaR @ NeurIPS 2024

NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with LLMs

William Tan, Kevin Zhu

NusaMT-7B is an LLM-based machine translation model for low-resource Indonesian languages, starting with Balinese and Minangkabau. Leveraging the pretrained LLaMA2-7B, our approach integrates continued pre-training on monolingual data, Supervised Fine-Tuning (SFT), self-learning, and an LLM-based data cleaner to reduce noise in parallel sentences. In the FLORES-200 benchmark, NusaMT-7B outperforms state-of-the-art models by up to +6.69 spBLEU in translations into Balinese and Minangkabau, with up to a 45% increase in spBLEU in the Indonesian to Balinese direction.

Begin Your Journey

The application takes 10 minutes and is reviewed on a rolling basis. We look for strong technical signal—projects, coursework, or competition results—and a genuine curiosity to do real research.

If admitted, you will join a structured pipeline with direct mentorship to take your work from ideation to top conference submission at venues like NeurIPS, ACL, and EMNLP.

Begin Your Journey