Our NeurIPS Publications
Note for high schoolers: NeurIPS acceptances are significantly more difficult compared to high school science fairs. Less than 0.2% of authors at NeurIPS are high school students.
Spring Deadline: Sunday, March 1 @ 11:59pm PT. Click to apply.

Algoverse research teams have consistently achieved publication success at top AI venues such as NeurIPS, EMNLP, and ACL.
These conferences primarily feature work from Ph.D. students and professional researchers at leading industry and academic labs, where acceptance rates are typically 30-50%. Our research teams have significantly exceeded the baseline results, reflecting our emphasis on rigorous mentorship and research quality.
AI conferences are where new research is reviewed and presented. The papers below were accepted through competitive peer review at leading venues such as NeurIPS and ACL.
Note for high schoolers: NeurIPS acceptances are significantly more difficult compared to high school science fairs. Less than 0.2% of authors at NeurIPS are high school students.
Daniel Aarao Reis Arturi, Eric Zhang, Andrew Adrian Ansah
Recent work has discovered that large language models can develop broadly misaligned behaviors after being fine-tuned on narrowly harmful datasets, a phenomenon known as emergent misalignment (EM). However, the fundamental mechanisms enabling such harmful generalization across disparate domains remain poorly understood. In this work, we adopt a geometric perspective to study EM and demonstrate that it exhibits a fundamental cross-task linear structure in how harmful behavior is encoded across different datasets. Specifically, we find a strong convergence in EM parameters across tasks, with the fine-tuned weight updates showing relatively high cosine similarities, as well as shared lower-dimensional subspaces as measured by their principal angles and projection overlaps.
Sayam Goyal, Brad Peters, María Emilia Granda, Akshath Vijayakumar Narmadha, Dharunish Yugeswardeenoo
Latent reasoning language models aim to improve reasoning efficiency by computing in continuous hidden space rather than explicit text, but the opacity of these internal processes poses major challenges for interpretability and trust. We present a mechanistic case study of CODI (Continuous Chain-of-Thought via Self-Distillation), a latent reasoning model that solves problems by chaining "latent thoughts." Using attention analysis, SAE based probing, activation patching, and causal interventions, we uncover a structured "scratchpad computation" cycle: even numbered steps serve as scratchpads for storing numerical information, while odd numbered steps perform the corresponding operation.
Advey Nandan, Cheng-Ting Chou, Amrit Kurakula
We investigate neuron universality in independently trained GPT-2 Small models, examining how universal neurons emerge and evolve throughout training. By analyzing five GPT-2 models at three checkpoints (100k, 200k, 300k steps), we identify universal neurons through pairwise correlation analysis of activations over a dataset of 5 million tokens. We find that 1-5% of neurons pass a target threshold of universality compared to random baselines. Ablation experiments reveal significant functional impacts of universal neurons on model predictions. Layer-wise ablation reveals that ablating universal neurons in the first layer causes a disproportionately large increase in both KL divergence and loss, suggesting early-layer universal neurons play a particularly critical role in shaping final predictions.
Rishab Alagharu, Ishneet Sukhvinder Singh, Anjali Batta, Jaelyn S. Liang, Shaibi Shamsudeen, Arnav Sheth
We investigate whether categorical refusal tokens enable controllable, interpretable refusal behavior in language models. Using a fine-tuned version of LLAMA-3 8B BASE with categorical refusal tokens, we extract residual-stream activations and compute category-specific steering vectors. Our contributions include extracting category-specific refusal steering vectors, providing empirical evidence that categorical steering reduces over-refusal on ambiguous and benign prompts while preserving refusal on harmful ones across safety benchmarks, and analysis showing that the identified refusal features are distinct, interpretable, and arise from refusal-token fine-tuning.
Saleena Angeline Sartawita, McNair Shah, Adhitya Rajendra Kumar, Naitik Chheda
We introduce a multidimensional framework for probing and steering harmful content in model internals. For each of 55 distinct harmfulness subconcepts (e.g., racial hate, employment scams, weapons), we learn a linear probe, yielding 55 interpretable directions in activation space. Collectively, these directions span a harmfulness subspace that is strikingly low-rank. We find that steering in the dominant direction of this subspace allows for near elimination of harmful responses on a jailbreak dataset with a minor decrease in utility. Our findings advance the view that concept subspaces provide a scalable lens on LLM behaviour and offer practical tools for the community to audit and harden future generations of language models.
Mrinal Agarwal, Saad Rana, Theo Sundoro, Hermela Berhe
We introduce WOLF (Weighted Online Learning Framework), a novel framework for evaluating large language model robustness in multi-turn conversational settings under adversarial pressure. WOLF systematically tests model responses when users attempt to manipulate, mislead, or extract harmful information through extended dialogue. Our framework provides quantitative metrics for measuring model stability and safety across conversation trajectories.
Aayush Aluru, Myra N. Malik, Samarth Patankar
Multi-agent systems often achieve higher reasoning accuracy than single models, but their reliance on repeated debates across agents makes them computationally expensive. We introduce SMAGDi, a distillation framework that transfers the debate dynamics of a five-agent Llama-based multi-agent system (MAS) into a compact Socratic decomposer-solver student. SMAGDi represents debate traces as directed interaction graphs, where nodes encode intermediate reasoning steps with correctness labels and edges capture continuity and cross-agent influence. On StrategyQA and MMLU, SMAGDi compresses a 40B multi-agent system into a 6B student while retaining 88% of its accuracy, substantially outperforming prior distillation methods.
Claire O'Brien, Jessica Seto, Dristi Roy, Aditya Dwivedi
Behavioral alignment in large language models (LLMs) is often achieved through broad fine-tuning, which can result in undesired side effects like distributional shift and low interpretability. We propose a method for alignment that identifies and updates only the neurons most responsible for a given behavior, a targeted approach that allows for fine-tuning with significantly less data. Using sparse autoencoders (SAEs) and linear probes, we isolate the 3% of MLP neurons most predictive of a target behavior, decode them into residual space, and fine-tune only those neurons using gradient masking. We demonstrate our approach on reducing sycophantic behavior, matching or exceeding state-of-the-art performance on four benchmarks.
Xirui Huang, Joongho Kim
Tree-of-Thought (ToT) reasoning boosts the problem-solving abilities of Large Language Models (LLMs) but is computationally expensive due to semantic redundancy, where distinct branches explore equivalent reasoning paths. We introduce Semantic Similarity-Based Dynamic Pruning (SSDP), the first framework to integrate online semantic merging into parallelized tree search, enabling the clustering and pruning of redundant steps in real time. Across reasoning benchmarks, including GSM8K and MATH500, SSDP achieves up to a 2.3x speedup over state-of-the-art tree-search baselines while maintaining competitive accuracy (typically within 5% of the strongest baseline) and reducing the number of explored nodes by 85-90%.
Nathan Egbuna, Saatvik Gaur
Current test-time optimization methods require 10-100x more compute per query than standard decoding. We propose Amortized Latent Steering (ALS), which collapses iterative test-time optimization into a single offline-computed vector applied at constant cost during inference. ALS computes the mean difference between hidden states from successful versus unsuccessful generations, then uses this direction to calibrate the model hidden representations. Across GSM8K and MATH-500 benchmarks, ALS achieves 2-5x speedup over iterative methods while matching or surpassing greedy Chain-of-Thought and Self-Consistency baselines, yielding up to 101% improvement in efficiency-accuracy trade-off.
Adhyayan Veer Singh, Aaron Shen, Brian Law, Ahmed Ismail
We present SwiftSolve, a novel approach to accelerating mathematical reasoning in large language models. By leveraging efficient computation patterns and strategic caching of intermediate reasoning steps, SwiftSolve achieves significant speedups on mathematical problem-solving benchmarks while maintaining accuracy. Our method introduces a hierarchical reasoning cache that stores reusable solution patterns, enabling the model to quickly retrieve and adapt known solution strategies to novel problems.
Aryan Singhal, Veronica Shao, Gary Sun, Ryan Ding
We investigate systematic biases in neural machine translation (NMT) systems when translating text between languages with different cultural contexts. Our analysis reveals that NMT systems often produce translations that reflect the dominant cultural perspectives present in their training data, leading to subtle but significant meaning shifts. We propose a framework for measuring and mitigating these translation biases, introducing metrics that capture semantic drift across cultural dimensions. Experiments on 15 language pairs demonstrate the prevalence of these biases and the effectiveness of our debiasing approaches.
Zhumazhan Balapanov, Edward Magongo, Vanessa Matvei, Olivia Holmberg
We introduce QIANets (Quantum-Inspired Attention Networks), a novel architecture that leverages quantum-inspired computational principles to achieve efficient attention computation. By reformulating the attention mechanism using tensor network decompositions inspired by quantum many-body physics, we achieve sub-quadratic complexity in sequence length while maintaining model expressiveness. Our approach demonstrates significant speedups on long-context tasks, with experiments showing 3-5x inference acceleration compared to standard transformers on sequences of 8K+ tokens.
Tim Knappe, Ryan Li, Ayush Chauhan, Kaylee Chhua
We propose Semantic Self-Consistency (SSC), a novel framework for evaluating the reasoning capabilities of large language models. SSC measures whether a model produces semantically equivalent answers when presented with logically equivalent formulations of the same question. Unlike traditional consistency metrics that focus on exact string matching, SSC captures deeper semantic alignment through learned embeddings. Our experiments reveal significant inconsistencies in state-of-the-art models, with performance dropping by 15-30% on semantically rephrased questions. We release a benchmark of 10,000 question pairs for evaluating SSC.
Pranav Senthilkumar, Visshwa Bala, Prisha Jain, Aneesa Maity
Current approaches to aligning large language models (LLMs) with human values often treat ethical decisions as binary classifications. We argue that this approach fails to capture the nuanced nature of real-world ethical dilemmas. We introduce an ethical ambiguity fine-tuning framework that teaches LLMs to recognize, articulate, and reason about situations where multiple valid ethical perspectives exist. Our method leverages a curated dataset of morally ambiguous scenarios annotated with diverse stakeholder perspectives. Experiments show that models fine-tuned with our approach demonstrate improved nuance in ethical reasoning while maintaining safety guardrails.
William Tan
We present NusaMT-7B, a 7-billion parameter multilingual machine translation model specifically designed for Southeast Asian languages. Despite the region being home to over 1,200 languages, existing translation systems provide limited support for most of them. NusaMT-7B covers 23 Southeast Asian languages, including many low-resource languages like Javanese, Sundanese, and Khmer. We introduce novel training techniques for handling low-resource language pairs and demonstrate state-of-the-art performance on the FLORES benchmark for covered languages, with particular gains for underrepresented language pairs.
Rajat Rawat, Hudson McBride, Rajarshi Ghosh, Dhiyaan Nirmal, Jong Moon, Dhruv Alamuri
As large language models (LLMs) gain traction in healthcare, concerns about their susceptibility to demographic biases are growing. We introduce DiversityMedQA, a novel benchmark designed to assess LLM responses to medical queries across diverse patient demographics, such as gender and ethnicity. By perturbing questions from the MedQA dataset, which comprises of medical board exam questions, we created a benchmark that captures the nuanced differences in medical diagnosis across varying patient profiles. To ensure that our perturbations did not alter the clinical outcomes, we implemented a filtering strategy to validate each perturbation, so that any performance discrepancies would be indicative of bias. Our findings reveal notable discrepancies in model performance when tested against these demographic variations. By releasing DiversityMedQA, we provide a resource for evaluating and mitigating demographic bias in LLM medical diagnoses.
Abhay Gupta, Philip Meng, Ece Yurtseven
Large language models (LLMs) frequently generate plausible-sounding but factually incorrect outputs, known as hallucinations. We introduce AAVENUE (Activation Analysis for Verifying Extensive Neural Unit Explanations), a novel approach that detects hallucinations by analyzing internal model activations during generation. Our method identifies characteristic activation patterns associated with hallucinated content, enabling real-time detection without requiring external knowledge bases. AAVENUE achieves 87% accuracy on hallucination detection across diverse domains, significantly outperforming baseline approaches. We release our trained detection models and a benchmark dataset of labeled hallucinations.
The application takes 10 minutes and is reviewed on a rolling basis. We look for strong technical signal—projects, coursework, or competition results—and a genuine curiosity to do real research.
If admitted, you will join a structured pipeline with direct mentorship to take your work from ideation to top conference submission at venues like NeurIPS, ACL, and EMNLP.
