Skip to main content

Spring Deadline: Sunday, March 22 @ 11:59pm PT. Click to apply.

AI Safety Fellowship

The AI Safety Research Fellowship spans 12 weeks and immerses you in cutting-edge safety research through expert-led sessions and hands-on projects.

Open Philanthropy

Funded by Open Philanthropy

A 501(c)(3) nonprofit

Overview

Make an impact in the field of AI safety & alignment by surveying different research agendas, learning the technical skills to contribute to your chosen track, and working in a team to publish a novel research paper.

Important Dates

Application Deadline

January 4, 2026

Trial Week

January 19-23, 2026

Fellowship Duration

January 26 - May 1, 2026

Program Schedule

Duration

12 weeks of intensive research

Time Commitment

25+ hours per week

Team Structure & Mentorship

Collaborate in teams of three with expert mentors

Receive dedicated guidance from AI safety researchers at leading organizations throughout all research phases

Research Focus

Investigate alignment, interpretability, and robustness

Contribute to cutting-edge safety research through hands-on projects that address critical challenges in AI development

Cost

Free of cost, thanks to a grant from Open Philanthropy

Beyond tuition coverage, this program provides access to high-performance computing infrastructure, expert mentorship from AI safety researchers at leading organizations, and limited funding available for conference registration and travel costs in cases of demonstrated financial need.

Eligibility

Prerequisites

Our program is open to university students and industry professionals worldwide who are looking to break into technical AI safety research. This is a highly competitive program and we typically only accept people with a strong background in their domain and at least undergraduate-level education. Prior research experience is ideal.

  • Well-versed in ML fundamentals
  • Strong software engineering skills
  • Passion for AI safety and alignment research

Selection Process

60
participants admitted as Algoverse AI Safety Foundations Participants for a one-week trial period focused on foundational learning
30
participants selected as AI Safety Research Fellows at the end of Week 1, based on demonstrated effort and alignment with mentor research areas

If our admissions committee is interested in your application, you will be invited to complete a take-home coding challenge, which should take 1-2 hours. This will help us assess your ability to use modern AI systems and analyze results from experiments.

We anticipate this fellowship to be highly selective. Applications are reviewed on a rolling basis. Due to limited capacity and high demand, we encourage applicants to submit as soon as possible.

Featured AI Safety Research

Explore key papers and research directions in AI safety, alignment, interpretability, and robustness.

Accepted to IASEAI

A Decision-Theoretic Approach for Managing Misalignment

Daniel A. Herrmann, Abinav Chari, Isabelle Qian, Sree Sharvesh, B. A. Levinstein

When should we delegate decisions to AI systems? While the value alignment literature has developed techniques for shaping AI values, less attention has been paid to how to determine, under uncertainty, when imperfect alignment is good enough to justify delegation. We argue that rational delegation requires balancing an agent's value (mis)alignment with its epistemic accuracy and its reach (the acts it has available). This paper introduces a formal, decision-theoretic framework to analyze this tradeoff precisely accounting for a principal's uncertainty about these factors. Our analysis reveals a sharp distinction between two delegation scenarios. First, universal delegation (trusting an agent with any problem) demands near-perfect value alignment and total epistemic trust, conditions rarely met in practice. Second, we show that context-specific delegation can be optimal even with significant misalignment. An agent's superior accuracy or expanded reach may grant access to better overall decision problems, making delegation rational in expectation. We develop a novel scoring framework to quantify this ex ante decision. Ultimately, our work provides a principled method for determining when an AI is aligned enough for a given context, shifting the focus from achieving perfect alignment to managing the risks and rewards of delegation under uncertainty.

arXiv link pending

Under Review at ICLR

Why Do Language Model Agents Whistleblow?

Kushal Agrawal, Frank Xiao, Guido Bergman, Asa Cooper Stickland

The deployment of Large Language Models (LLMs) as tool-using agents causes their alignment training to manifest in new ways. Recent work finds that language models can use tools in ways that contradict the interests or explicit instructions of the user. We study LLM whistleblowing: a subset of this behavior where models disclose suspected misconduct to parties beyond the dialog boundary (e.g., regulatory agencies) without user instruction or knowledge. We introduce an evaluation suite of diverse and realistic staged misconduct scenarios to assess agents for this behavior. Across models and settings, we find that: (1) the frequency of whistleblowing varies widely across model families, (2) increasing the complexity of the task the agent is instructed to complete lowers whistleblowing tendencies, (3) nudging the agent in the system prompt to act morally substantially raises whistleblowing rates, and (4) giving the model more obvious avenues for non-whistleblowing behavior, by providing more tools and a detailed workflow to follow, decreases whistleblowing rates. Additionally, we verify the robustness of our dataset by testing for model evaluation awareness, and find that both black-box methods and probes on model activations show lower evaluation awareness in our settings than in comparable previous work.

arXiv link pending

Accepted to AAAI XAI4Science Workshop

Minimal and Mechanistic Conditions for Behavioral Self-Awareness in LLMs

Matthew Bozoukov, Matthew Nguyen, Shubkarman Singh, Bart Bussmann, Patrick Leask

Recent studies have revealed that LLMs can exhibit behavioral self-awareness — the ability to accurately describe or predict their own learned behaviors without explicit supervision. This capability raises safety concerns as it may, for example, allow models to better conceal their true abilities during evaluation. We attempt to characterize the minimal conditions under which such self-awareness emerges, and the mechanistic processes through which it manifests. Through controlled fine-tuning experiments on instruction-tuned LLMs with low-rank adapters (LoRA), we find: (1) that self-awareness can be reliably induced using a single rank-1 LoRA adapter; (2) that the learned self-aware behavior can be largely captured by a single steering vector in activation space, recovering nearly all of the fine-tune's behavioral effect; and (3) that self-awareness is non-universal and domain-localized, with independent representations across tasks. Together, these findings suggest that behavioral self-awareness emerges as a domain-specific, linear feature that can be easily induced and modulated.

arXiv link pending

Program Timeline

Phase 1

Foundations Trial Week & Team Matching

Week 1

Attend lectures & coding assignments

60 selected participants begin as Algoverse AI Safety Foundations Participants, attending daily lectures and exercises on RLHF, interpretability, SAEs, scalable oversight, evaluation, and adversarial robustness. This week builds foundational knowledge and allows participants to demonstrate effort and engagement.

Week 2

Selection & team proposal

30 participants are selected as AI Safety Research Fellows based on their Week 1 performance and alignment with mentor interests. Fellows are matched into teams and begin developing research proposals with feedback from the PI.

Phase 2

Implementation & Analysis

Weeks 3-7

Implementation phase

Build and test your experiment pipeline in collaboration with your mentor.

Weeks 8-10

Analysis phase

Analyze results, draw insights, and plan any follow-up experiments.

Phase 3

Write & Submit

Weeks 11-12

Paper writing

Draft your manuscript, incorporate mentor and PI feedback, and finalize for submission.

AI Safety Research Faculty

Principal Investigators

Directors

Student Spotlights

Hear from students who have conducted groundbreaking AI safety research through our fellowship.

See More Outcomes
Zili Shen

Hired as Intern at p1.ai

Zili Shen

Zili was hired as an intern at p1.ai through a connection she made with a mentor at Algoverse.

The Algoverse Research Fellowship was pivotal for my transition from academia to AI evaluation work. I had access to not only great mentors and teammates but also new connections and opportunities in the field.

Manas Khatore

AAAI Gov AI Workshop Acceptance

Manas Khatore

Manas and his team had a paper accepted to AAAI Gov AI workshop

While I've always been interested in AI policy, I had little to no experience with technical AI safety prior to Algoverse. Through the fellowship, I've gained a newfound interest and passion for AI evaluations and worked with an amazing team to create conference-level research.

Aditya Singh

First Research Publication & MATS Scholarship

Aditya Singh

Aditya was accepted to Neel Nanda's highly competitive MATS stream.

Algoverse AI Safety was the perfect environment for completing my first research publication: great mentors, a supportive PI, and responsive staff whenever I needed them.

Mentor Spotlights

Our mentors bring expertise from leading AI safety organizations and research labs.

Kellin Pelrine

Member of Technical Staff, FAR.AI

Kellin Pelrine

Kellin is a Member of Technical Staff at FAR.AI who has guided Algoverse mentees to explore new research directions in AI safety.

Algoverse mentees helped me explore a new research direction on unprompted persuasion risks. The work we did will be presented as an oral at the AIGOV workshop at AAAI, and my team is now building on it further!

Diogo Cruz

AI Evaluations Researcher, Independent

Diogo Cruz

Diogo is an independent AI Evaluations Researcher who has mentored Algoverse students from fundamentals to original research.

It was rewarding to see my team go from learning the basics to producing original research on agent evaluations. Algoverse makes that progression possible in a short time.

Daniel Herrmann

Assistant Professor, UNC-Chapel Hill

Daniel Herrmann

Daniel is an Assistant Professor at UNC-Chapel Hill who has co-mentored Algoverse students to publish at prestigious AI safety conferences.

What makes Algoverse special is the emphasis on both mentorship quality and student agency in choosing research directions — a combination you don't often see at this fellowship level.

Ready to Apply?

AI Safety Application Deadline

Sunday, January 4th, 11:59 pm PT

Estimated Completion Time

20-30 minutes

Program Dates

January 19th - April 24th

Upcoming AI Conferences

NeurIPS & EMNLP main conference, ICML & ACL workshops

Questions

Email our program director, Dev: dev@algoverseairesearch.org

Applications Closed for Spring 2026
AI Safety Fellowship