Announcing the OpenAI Safety Fellowship
Key Summary
- •OpenAI announced a short, mentored Safety Fellowship (Sep 14, 2026 to Feb 5, 2027) to help independent researchers do high-impact work on making AI safer.
- •It focuses on practical topics like safety evaluations, robustness, alignment, privacy-preserving safety, agentic oversight, and high-severity misuse.
- •Fellows get a stipend, compute support, API credits, and mentorship; workspace is available in Berkeley at Constellation, but remote work is allowed.
- •Applicants from many fields (CS, social science, cybersecurity, privacy, HCI, etc.) are welcome; strong research ability matters more than specific credentials.
- •Applications close May 3, with decisions by July 25; letters of reference are required.
- •Fellows are expected to deliver a concrete research output such as a paper, benchmark, or dataset by the end of the program.
- •Fellows will not have internal system access, preserving independence and focusing on public, reproducible work.
- •The program aims to grow the next generation of AI safety talent and speed up practical, evidence-based solutions.
- •This pilot encourages collaboration, shared methods, and results that help the broader research community.
- •It fills a gap between academic study and industry by giving independent researchers resources and mentorship tied directly to real safety problems.
Why This Research Matters
AI increasingly helps with schoolwork, healthcare information, customer service, and more, so safety improvements affect millions of daily interactions. A focused fellowship speeds up creating practical tools—tests, datasets, and mitigations—that developers everywhere can plug in. Opening the doors to people from different fields means we catch a wider range of risks, from technical exploits to social harms. Keeping outputs public and reproducible lets the entire community learn faster and build on each other’s work. Guardrails around privacy and high-severity misuse protect people while research moves forward. Mentorship and compute support remove common barriers that slow or stop independent researchers. Over time, this can make AI systems more trustworthy and reduce harm in the real world.
Reading Workflow
Turn this paper into a decision
Scan fast. Promote only the papers that survive triage.
No workflow history yet.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how when a new invention shows up—like drones or electric scooters—people get excited, but leaders also ask, “How do we keep everyone safe while using this?” That’s been happening with AI, too. AI is getting smarter and more helpful, but we need to make sure it behaves safely and matches what people actually want.
🍞 Top Bread (Hook): Imagine a giant amusement park ride. Before anyone gets on, engineers check the bolts, test the brakes, and run practice rides. 🥬 Filling (The Actual Concept): AI Safety is making sure AI systems work correctly, don’t cause harm, and behave as intended.
- How it works: 1) Plan tests that try normal and tricky situations. 2) Watch what the AI does. 3) Fix problems and add protections. 4) Repeat until it’s reliable.
- Why it matters: Without AI Safety, even a helpful system could make harmful mistakes or be misused. 🍞 Bottom Bread (Anchor): Like checking a roller coaster before opening day, AI Safety means testing and improving AI so it’s safe for everyone.
Before this fellowship, lots of smart people cared about AI safety, but many worked alone or without the right tools. Some were students without funding. Others were professionals in fields like cybersecurity or social science who had great ideas but not enough compute, mentorship, or a team to compare notes with. The result? Progress was slower and more scattered than it needed to be.
🍞 Top Bread (Hook): Think of a strong umbrella that keeps you dry even when the wind is wild and the rain blows sideways. 🥬 Filling (The Actual Concept): Robustness means AI keeps working well even when things get weird or unexpected.
- How it works: 1) Try the AI on clean, normal inputs. 2) Try it on noisy, tricky, or adversarial inputs. 3) See where it breaks. 4) Reinforce it so it handles surprises.
- Why it matters: Without robustness, an AI that works in a lab might fail in the real world. 🍞 Bottom Bread (Anchor): A spelling fixer that only works on perfect typos but fails on slang isn’t robust; a robust one handles both.
🍞 Top Bread (Hook): You know how scientists don’t just guess—they run experiments and collect evidence. 🥬 Filling (The Actual Concept): Empirical research means learning from data and measurements instead of just opinions or theory.
- How it works: 1) Pose a clear question. 2) Design a test. 3) Gather results. 4) Share what you found so others can repeat it.
- Why it matters: Without evidence, we can’t know if a safety method actually works. 🍞 Bottom Bread (Anchor): Testing a new seatbelt by measuring crash outcomes is empirical research; doing the same for AI safety tools is too.
Another big idea is making sure AI’s goals match human values.
🍞 Top Bread (Hook): Imagine setting a GPS—you want it to guide you to your real destination, not just anywhere with the same name. 🥬 Filling (The Actual Concept): AI Alignment is ensuring AI aims for what humans truly intend and value.
- How it works: 1) Teach the AI what’s acceptable via examples and feedback. 2) Check how it generalizes to new tasks. 3) Correct it when it drifts. 4) Keep monitoring.
- Why it matters: Without alignment, AI could be powerful but head in the wrong direction. 🍞 Bottom Bread (Anchor): If you ask for “ways to stay healthy,” an aligned AI gives safe advice, not extreme or harmful ideas.
We also need to protect people’s information while doing safety work.
🍞 Top Bread (Hook): Picture a lockbox that lets you use what’s inside without letting others peek. 🥬 Filling (The Actual Concept): Privacy-Preserving Safety Methods let us test and improve AI safety without exposing private data.
- How it works: 1) Minimize data used. 2) Add protections like anonymization. 3) Share only what’s needed. 4) Audit for leaks.
- Why it matters: Without privacy, people could get hurt by data exposure during safety testing. 🍞 Bottom Bread (Anchor): Like blurring faces in a video before sharing, safety researchers can hide identities while measuring risks.
And when AIs act like helpers that can take steps on their own, someone should keep watch.
🍞 Top Bread (Hook): Think of a lifeguard at a pool—swimmers have fun, but someone trained is always watching. 🥬 Filling (The Actual Concept): Agentic Oversight means humans supervise AI systems that can plan or act to ensure they stay safe.
- How it works: 1) Set rules and boundaries. 2) Track actions. 3) Intervene when needed. 4) Learn from near-misses.
- Why it matters: Without oversight, small mistakes can snowball into big problems. 🍞 Bottom Bread (Anchor): A shopping assistant AI should ask a human before placing a large order—that’s oversight.
Finally, some areas are especially risky if misused.
🍞 Top Bread (Hook): A kitchen knife helps you cook, but using it carelessly can really hurt someone. 🥬 Filling (The Actual Concept): High-Severity Misuse Domains are places where AI misuse could cause serious harm.
- How it works: 1) Identify risky topics. 2) Put stronger safeguards and checks there. 3) Test extra carefully. 4) Limit dangerous capabilities.
- Why it matters: Without extra care, bad actors could turn helpful tools into harmful ones. 🍞 Bottom Bread (Anchor): Preventing an AI from giving step-by-step instructions for building a weapon is guarding a high-severity domain.
The problem researchers faced was getting all these pieces—robustness, alignment, privacy, oversight, and attention to high-severity risks—into one practical work plan with the right support. People tried small grants, informal reading groups, or one-off internships, but those often lacked tight mentorship, compute, or clear, shared outputs. The OpenAI Safety Fellowship fills this gap by offering a short, focused, mentored program that funds independent, empirical safety projects and expects a concrete output like a paper, benchmark, or dataset. Why should you care? Because safer AI shapes everyday life: better filters against scams, more reliable homework help, stronger privacy for your searches, and protection against dangerous misuse. This fellowship is a way to speed up that safety work and include voices from many fields, not just one.
02Core Idea
The “aha!” is simple: bring independent minds together with mentorship, compute, and a clear deadline so they can produce real, testable safety results that help everyone.
To make that click, try three different pictures:
- A lifeguard training camp: instead of one person watching the pool alone, you train a whole team quickly, give them whistles and rescue tubes (tools and mentorship), and set drills (empirical tests) so the beach becomes safer fast.
- A neighborhood watch with toolkits: you don’t just ask neighbors to “be careful”; you give them flashlights, radios, and a map, plus a schedule and a captain. Now lots of small efforts point in the same safe direction.
- A community garden: you supply seeds, soil, and guidance, then invite many gardeners to grow different plants. By harvest time, you get vegetables (papers), fruit (benchmarks), and seeds for next season (datasets).
Before vs After:
- Before: independent safety researchers often lacked compute, mentorship, or a cohort. Progress was valuable but slower and siloed.
- After: fellows get resources, a mentor, peers, and a fixed runway to deliver something reproducible and public. That concentrates effort and spreads the best methods to the broader community.
Why it works (intuition, not equations):
- Focus and timeboxing reduce drift. A clear end date encourages scoping achievable, high-impact projects.
- Embedded mentorship shortens feedback loops, so mistakes get caught early and promising ideas scale faster.
- Empirical grounding keeps projects honest; if a method doesn’t work in tests, it gets fixed or dropped.
- Diversity of backgrounds reduces blind spots; a cybersecurity expert, a social scientist, and an HCI researcher see different risks and solutions.
- Independence with guardrails (no internal system access) avoids overfitting to one lab’s private setups and keeps outputs reproducible.
Building blocks of the idea:
- Priority areas: safety evaluation, ethics, robustness, scalable mitigations, privacy-preserving methods, agentic oversight, and high-severity misuse.
- Resources: stipend, compute support, API credits, workspace in Berkeley at Constellation or remote, and ongoing mentorship.
- People: open to CS, social science, cybersecurity, privacy, HCI, and related fields; research ability and execution matter more than degrees.
- Process: apply by May 3; references required; notifications by July 25; program runs Sep 14, 2026 to Feb 5, 2027.
- Outputs: a substantial, shareable artifact—paper, benchmark, or dataset—that the whole community can use and build on.
03Methodology
At a high level: Applicant → Choose a safety problem → Propose an empirical project → Apply with plan and references → Review and selection → Mentored, resourced sprints → Deliver a public output (paper/benchmark/dataset).
Step-by-step, like a recipe:
- Pick a meaningful safety question.
- What happens: You choose a topic that matters for current or future AI systems (e.g., robustness to prompt attacks or privacy-preserving evaluations).
- Why this step exists: Clear questions prevent vague projects that are hard to finish or measure.
- Example: “Can we design a test suite that catches when a model starts giving risky bio instructions, and can we measure how different guardrails reduce that risk?”
- Design an empirical plan.
- What happens: You outline datasets, evaluation metrics, baselines to compare against, and success criteria. You describe compute needs and expected timeline.
- Why this step exists: Without a plan, it’s hard to know if your method works or if you’re making progress.
- Example: “Collect a red-team dataset of jailbreak prompts plus safe mitigations; measure how often the model refuses harmful requests before and after each mitigation.”
- Prepare the application.
- What happens: You write up your proposal, show evidence of research ability and technical judgment, and request letters of reference.
- Why this step exists: Reviewers need to understand your idea and your track record to judge feasibility and impact.
- Example: A 3–5 page plan describing the safety question, methods, resources, risks, and a timeline ending in a benchmark release.
- Submit by the deadline (applications close May 3) and await review.
- What happens: OpenAI reviews all submissions and notifies successful applicants by July 25.
- Why this step exists: A fair, timely process lets the cohort start together and learn from each other.
- Example: You submit in April, get confirmation, and hear back in late July.
- Onboarding and setup.
- What happens: Selected fellows get matched with mentors, gain access to a stipend, compute support, API credits, and optionally a desk in Berkeley at Constellation (or go remote).
- Why this step exists: Early setup removes blockers so you can spend your time doing research, not hunting for resources.
- Example: Your mentor helps you refine your evaluation metrics and get the right GPUs lined up.
- Milestones and sprints.
- What happens: You break the project into short sprints with check-ins. You test early prototypes, iterate, and document results.
- Why this step exists: Regular milestones catch issues early and keep scope realistic.
- Example: Week 3: pilot evaluation on 200 prompts; Week 6: integrate a new mitigation; Week 9: compare against a baseline.
- Mid-program review.
- What happens: You share interim findings with your mentor and cohort, get feedback, and adjust course if needed.
- Why this step exists: Mid-course corrections often save time and boost quality before the final push.
- Example: Feedback suggests adding a privacy check to your dataset release plan.
- Produce the final output.
- What happens: You prepare a paper, benchmark, or dataset that the community can reproduce and use.
- Why this step exists: A concrete, public artifact spreads the benefit beyond the fellowship.
- Example: You release a benchmark repo with documentation, scoring scripts, and a short paper analyzing results.
- Share and hand off.
- What happens: You present to the cohort, publish your results, and make recommendations for future work.
- Why this step exists: Good handoffs seed follow-on research and real-world adoption.
- Example: You propose a challenge track for future red-teaming based on your benchmark.
The secret sauce:
- Independence with support: You lead your project, but mentors accelerate learning and keep impact high.
- Empirical first: The fellowship prizes measurable results, making progress visible and reusable.
- Cross-discipline mix: Diverse fellows catch different risks and invent broader solutions.
- Clear boundaries: No internal system access keeps projects reproducible and focused on publicly relevant methods.
Important logistics in plain terms:
- Program dates: Sep 14, 2026 → Feb 5, 2027.
- Location: Workspace in Berkeley at Constellation or fully remote.
- Resources: Monthly stipend, compute support, API credits, ongoing mentorship.
- Eligibility: Wide range of backgrounds; references required; strong research ability over credentials.
- Apply: Applications open now; close May 3; notifications by July 25; questions: openaifellows@constellation.org.
04Experiments & Results
Because this is a program announcement (a pilot), there aren’t lab experiments with numbers yet. Instead, think of the “tests” as how the fellowship will judge and create impact.
- The Test: What is measured and why
- Research ability: Can you design and run careful studies? This predicts whether your project will finish strong.
- Technical judgment: Do you pick methods that are likely to work and know how to evaluate them? This avoids dead-ends.
- Execution: Can you deliver on time and adapt when plans change? This matters in a short program.
- Relevance: Will your work help current and future AI systems be safer? This ensures community value.
- The Competition: What this is compared against
- Classic research grants: Give money, but less mentorship or cohort support.
- PhD programs: Provide depth and advising, but take years and may focus more on theory than quick, public outputs.
- Industry internships: Offer compute and mentorship, but often center on internal priorities and private results.
- The fellowship’s niche: short, mentored, independent, and public-facing with a strong empirical core.
- The Scoreboard: Results with context (what success looks like)
- Not a single percentage score, but concrete outcomes like: a new safety benchmark widely adopted; a dataset that makes red-teaming more systematic; a privacy-preserving evaluation protocol others can reuse; or a study that maps a high-severity misuse domain and shows which mitigations help most. Think of this as getting an A+ by creating tools that classmates across schools start using.
- Surprising findings (from the announcement details)
- Open to many fields beyond computer science, which broadens perspectives.
- Workspace offered in Berkeley but remote is fine, lowering barriers to join.
- No internal system access by design, focusing on reproducible, public research.
- Priority areas include both technical topics (robustness, scalable mitigations) and social/ethical concerns (ethics, high-severity misuse), signaling a balanced approach.
Since it’s a pilot, the first true “results” will be the fellows’ outputs by February 2027 and how quickly the wider community adopts them.
05Discussion & Limitations
Limitations:
- Time-bound: About five months is short, so projects must be tightly scoped; long-term studies may not fit.
- Capacity: Limited number of fellows means many good proposals won’t be funded this round.
- No internal access: Great for reproducibility, but some research questions that need private data or tools won’t be possible.
- Empirical tilt: Purely theoretical work without measurements may be a weaker fit.
- Logistics: Letters of reference and coordination across time zones can slow applicants down.
Required resources:
- A strong, well-scoped proposal and evidence of research ability.
- Willing mentors from the program and time commitment from the fellow.
- Reliable internet, compute plan (program helps), and data access consistent with privacy and safety rules.
- Willingness to produce a public artifact (paper, benchmark, or dataset).
When not to use this program:
- If your project needs long-term funding beyond a single season or depends on internal system access.
- If you’re building a commercial product rather than doing shareable safety research.
- If your topic can’t be evaluated empirically or you can’t commit to deadlines.
Open questions:
- How will success be measured across different project types (e.g., benchmarks vs. qualitative studies)?
- Will there be follow-on support for promising projects after February 2027?
- How will intellectual property and licensing be handled for datasets and tools (see application details)?
- How will the program ensure fair access across regions and disciplines?
- Will future cohorts expand priority areas based on what’s learned in this pilot?
06Conclusion & Future Work
In three sentences: The OpenAI Safety Fellowship is a short, mentored program that gives independent researchers the resources to do practical, evidence-based work on AI safety. It focuses on priority areas like evaluation, robustness, alignment, privacy, oversight, and high-severity misuse, aiming for concrete outputs that the whole community can use. By valuing research ability over credentials and keeping projects public and reproducible, it fills a key gap between academia and industry.
The main achievement is setting up a fast, focused pathway for diverse researchers to create shared safety tools—papers, benchmarks, and datasets—that raise the safety bar for everyone. Looking ahead, the biggest opportunities are scaling future cohorts, growing a library of standardized safety evaluations, and deepening collaboration across labs, universities, and independent researchers. Remember this fellowship because it treats safety not as a side quest, but as a team sport: it equips many people to find problems early, test fixes carefully, and share what works so AI helps more and harms less.
Practical Applications
- •Create a public benchmark that measures how well chatbots refuse dangerous requests across many tricky prompts.
- •Design a robustness test suite that simulates noisy, adversarial, or out-of-distribution inputs and reports clear failure modes.
- •Build a privacy-preserving evaluation pipeline that audits for potential data leakage during model interactions.
- •Prototype an agentic oversight tool that flags and pauses risky multi-step AI actions for human review.
- •Map a high-severity misuse domain (e.g., bio or cyber) and test which mitigations reduce successful misuse the most.
- •Develop scalable mitigations (e.g., layered filters and policy models) and compare their combined effect on safety metrics.
- •Release a red-team dataset with safe reference responses to standardize safety stress tests.
- •Run an ethics-informed user study to understand how different warning styles change risky user behavior.
- •Publish documentation and code for a transparent safety evaluation harness other teams can reuse.
- •Organize a community challenge that encourages external teams to submit improved safety defenses using your benchmark.