Skip to main content

Summer Deadline: Sunday, April 12 @ 11:59pm PT. Click to apply.

AI Safety Fellowship

The AI Safety Research Fellowship spans 12 weeks and immerses you in cutting-edge safety research through expert-led sessions and hands-on projects.

Open Philanthropy

Funded by Open Philanthropy

A 501(c)(3) nonprofit

Overview

Make an impact in the field of AI safety & alignment by surveying different research agendas, learning the technical skills to contribute to your chosen track, and working in a team to publish a novel research paper.

Important Dates

Application Deadline

January 4, 2026

Trial Week

January 19-23, 2026

Fellowship Duration

January 26 - May 1, 2026

Program Schedule

Duration

12 weeks of intensive research

Time Commitment

25+ hours per week

Team Structure & Mentorship

Collaborate in teams of three with expert mentors

Receive dedicated guidance from AI safety researchers at leading organizations throughout all research phases

Research Focus

Investigate alignment, interpretability, and robustness

Contribute to cutting-edge safety research through hands-on projects that address critical challenges in AI development

Cost

Free of cost, thanks to a grant from Open Philanthropy

Beyond tuition coverage, this program provides access to high-performance computing infrastructure, expert mentorship from AI safety researchers at leading organizations, and limited funding available for conference registration and travel costs in cases of demonstrated financial need.

Eligibility

Prerequisites

Our program is open to university students and industry professionals worldwide who are looking to break into technical AI safety research. This is a highly competitive program and we typically only accept people with a strong background in their domain and at least undergraduate-level education. Prior research experience is ideal.

  • Well-versed in ML fundamentals
  • Strong software engineering skills
  • Passion for AI safety and alignment research

Selection Process

60
participants admitted as Algoverse AI Safety Foundations Participants for a one-week trial period focused on foundational learning
30
participants selected as AI Safety Research Fellows at the end of Week 1, based on demonstrated effort and alignment with mentor research areas

If our admissions committee is interested in your application, you will be invited to complete a take-home coding challenge, which should take 1-2 hours. This will help us assess your ability to use modern AI systems and analyze results from experiments.

We anticipate this fellowship to be highly selective. Applications are reviewed on a rolling basis. Due to limited capacity and high demand, we encourage applicants to submit as soon as possible.

Featured AI Safety Research

Explore key papers and research directions in AI safety, alignment, interpretability, and robustness.

EACL SRW Spotlight Paper

You Didn't Have to Say It like That: Subliminal Learning from Faithful Paraphrases

Isaia Gisler, Zhonghao He, Tianyi Qiu

When language models are trained on synthetic data, they (student model) can covertly acquire behavioral traits from the data-generating model (teacher model). Subliminal learning refers to the transmission of traits from a teacher to a student model via training on data unrelated to those traits. Prior work demonstrated this in the training domains of number sequences, code, and math Chain-of-Thought traces including transmission of misaligned behaviors. We investigate whether transmission occurs through natural language paraphrases with fixed semantic content, and whether content explicitly contradicting the teacher's preference can block it. We find that training on paraphrases from a teacher system-prompted to love a particular animal increases a student's preference for that animal by up to 19 percentage points. This occurs when paraphrased content is semantically unrelated to the animal, or even when it explicitly expresses dislike. The transmission succeeds despite aggressive filtering to ensure paraphrase fidelity. This raises concerns for pipelines where models generate their own training data: content-based inspection cannot detect such transmission, and even preference-contradicting content fails to prevent it.

Accepted to IASEAI

A Decision-Theoretic Approach for Managing Misalignment

Daniel A. Herrmann, Abinav Chari, Isabelle Qian, Sree Sharvesh, B. A. Levinstein

When should we delegate decisions to AI systems? While the value alignment literature has developed techniques for shaping AI values, less attention has been paid to how to determine, under uncertainty, when imperfect alignment is good enough to justify delegation. We argue that rational delegation requires balancing an agent's value (mis)alignment with its epistemic accuracy and its reach (the acts it has available). This paper introduces a formal, decision-theoretic framework to analyze this tradeoff precisely accounting for a principal's uncertainty about these factors. Our analysis reveals a sharp distinction between two delegation scenarios. First, universal delegation (trusting an agent with any problem) demands near-perfect value alignment and total epistemic trust, conditions rarely met in practice. Second, we show that context-specific delegation can be optimal even with significant misalignment. An agent's superior accuracy or expanded reach may grant access to better overall decision problems, making delegation rational in expectation. We develop a novel scoring framework to quantify this ex ante decision. Ultimately, our work provides a principled method for determining when an AI is aligned enough for a given context, shifting the focus from achieving perfect alignment to managing the risks and rewards of delegation under uncertainty.

Under Review at ICLR

Why Do Language Model Agents Whistleblow?

Kushal Agrawal, Frank Xiao, Guido Bergman, Asa Cooper Stickland

The deployment of Large Language Models (LLMs) as tool-using agents causes their alignment training to manifest in new ways. Recent work finds that language models can use tools in ways that contradict the interests or explicit instructions of the user. We study LLM whistleblowing: a subset of this behavior where models disclose suspected misconduct to parties beyond the dialog boundary (e.g., regulatory agencies) without user instruction or knowledge. We introduce an evaluation suite of diverse and realistic staged misconduct scenarios to assess agents for this behavior. Across models and settings, we find that: (1) the frequency of whistleblowing varies widely across model families, (2) increasing the complexity of the task the agent is instructed to complete lowers whistleblowing tendencies, (3) nudging the agent in the system prompt to act morally substantially raises whistleblowing rates, and (4) giving the model more obvious avenues for non-whistleblowing behavior, by providing more tools and a detailed workflow to follow, decreases whistleblowing rates. Additionally, we verify the robustness of our dataset by testing for model evaluation awareness, and find that both black-box methods and probes on model activations show lower evaluation awareness in our settings than in comparable previous work.

Accepted to AAAI XAI4Science Workshop

Minimal and Mechanistic Conditions for Behavioral Self-Awareness in LLMs

Matthew Bozoukov, Matthew Nguyen, Shubkarman Singh, Bart Bussmann, Patrick Leask

Recent studies have revealed that LLMs can exhibit behavioral self-awareness — the ability to accurately describe or predict their own learned behaviors without explicit supervision. This capability raises safety concerns as it may, for example, allow models to better conceal their true abilities during evaluation. We attempt to characterize the minimal conditions under which such self-awareness emerges, and the mechanistic processes through which it manifests. Through controlled fine-tuning experiments on instruction-tuned LLMs with low-rank adapters (LoRA), we find: (1) that self-awareness can be reliably induced using a single rank-1 LoRA adapter; (2) that the learned self-aware behavior can be largely captured by a single steering vector in activation space, recovering nearly all of the fine-tune's behavioral effect; and (3) that self-awareness is non-universal and domain-localized, with independent representations across tasks. Together, these findings suggest that behavioral self-awareness emerges as a domain-specific, linear feature that can be easily induced and modulated.

Program Timeline

Phase 1

Foundations Trial Week & Team Matching

Week 1

Attend lectures & coding assignments

60 selected participants begin as Algoverse AI Safety Foundations Participants, attending daily lectures and exercises on RLHF, interpretability, SAEs, scalable oversight, evaluation, and adversarial robustness. This week builds foundational knowledge and allows participants to demonstrate effort and engagement.

Week 2

Selection & team proposal

30 participants are selected as AI Safety Research Fellows based on their Week 1 performance and alignment with mentor interests. Fellows are matched into teams and begin developing research proposals with feedback from the PI.

Phase 2

Implementation & Analysis

Weeks 3-7

Implementation phase

Build and test your experiment pipeline in collaboration with your mentor.

Weeks 8-10

Analysis phase

Analyze results, draw insights, and plan any follow-up experiments.

Phase 3

Write & Submit

Weeks 11-12

Paper writing

Draft your manuscript, incorporate mentor and PI feedback, and finalize for submission.

AI Safety Research Faculty

Principal Investigators

Directors

Student Spotlights

Hear from students who have conducted groundbreaking AI safety research through our fellowship.

See More Outcomes
1/2
Zili Shen

Zili Shen

Hired as Intern at p1.ai

The Algoverse Research Fellowship was pivotal for my transition from academia to AI evaluation work. I had access to not only great mentors and teammates but also new connections and opportunities in the field.

Zili was hired as an intern at p1.ai through a connection she made with a mentor at Algoverse.

Manas Khatore

Manas Khatore

AAAI Gov AI Workshop Acceptance

While I've always been interested in AI policy, I had little to no experience with technical AI safety prior to Algoverse. Through the fellowship, I've gained a newfound interest and passion for AI evaluations and worked with an amazing team to create conference-level research.

Manas and his team had a paper accepted to AAAI Gov AI workshop

Aditya Singh

Aditya Singh

First Research Publication & MATS Scholarship

Algoverse AI Safety was the perfect environment for completing my first research publication: great mentors, a supportive PI, and responsive staff whenever I needed them.

Aditya was accepted to Neel Nanda's highly competitive MATS stream.

Mentor Spotlights

Our mentors bring expertise from leading AI safety organizations and research labs.

1/2
Kellin Pelrine

Kellin Pelrine

Member of Technical Staff, FAR.AI

Algoverse mentees helped me explore a new research direction on unprompted persuasion risks. The work we did will be presented as an oral at the AIGOV workshop at AAAI, and my team is now building on it further!

Kellin is a Member of Technical Staff at FAR.AI who has guided Algoverse mentees to explore new research directions in AI safety.

arXiv Pending
Diogo Cruz

Diogo Cruz

AI Evaluations Researcher, Independent

It was rewarding to see my team go from learning the basics to producing original research on agent evaluations. Algoverse makes that progression possible in a short time.

Diogo is an independent AI Evaluations Researcher who has mentored Algoverse students from fundamentals to original research.

arXiv Pending
Daniel Herrmann

Daniel Herrmann

Assistant Professor, UNC-Chapel Hill

What makes Algoverse special is the emphasis on both mentorship quality and student agency in choosing research directions — a combination you don't often see at this fellowship level.

Daniel is an Assistant Professor at UNC-Chapel Hill who has co-mentored Algoverse students to publish at prestigious AI safety conferences.

arXiv Pending

Ready to Apply?

AI Safety Application Deadline

Sunday, January 4th, 11:59 pm PT

Estimated Completion Time

20-30 minutes

Program Dates

January 19th - April 24th

Upcoming AI Conferences

NeurIPS & EMNLP main conference, ICML & ACL workshops

Questions

Email our program director, Dev: dev@algoverseairesearch.org

Applications Closed for Spring 2026
AI Safety Fellowship