AI & Human Co-Improvement for Safer Co-Superintelligence

Jason Weston; Jakob Foerster

AI & Human Co-Improvement for Safer Co-Superintelligence

Beginner

Jason Weston, Jakob Foerster12/5/2025

arXiv PDF

Key Summary

•This paper argues that the fastest and safest path to super-smart AI is for humans and AIs to improve together, not for AI to improve alone.
•The authors call this co-improvement: AIs are built and trained to be great research partners for humans across the whole research pipeline.
•Keeping humans in the loop makes progress more transparent, steerable, and aligned with human values, reducing risks of misalignment and misuse.
•The paper proposes new benchmarks and training data focused on research collaboration skills, not just coding or puzzle-solving.
•Co-improvement aims to turn AI into a teammate that identifies problems, designs experiments, runs studies, and analyzes errors alongside people.
•This path can still reach superintelligence, but as co-superintelligence—where humans and AIs become smarter together.
•Compared to fully autonomous self-improvement, co-improvement better supports safety work, shared oversight, and collective scientific knowledge.
•The authors highlight practical categories to measure and train (e.g., ideation, benchmarking, safety co-design, multi-human+AI teamwork).
•They recommend managed openness and reproducible science so society can verify and build on progress.
•While mostly a position piece without new experiments, it outlines concrete steps and tests that labs can start running today.

Why This Research Matters

In real life, we want AI that helps us solve problems without creating new ones, and co-improvement keeps people in the loop to guide that help. It can speed up discoveries in medicine, education, and climate while ensuring the results align with human values. By training AI to collaborate—plan, critique, and analyze with us—we make progress more transparent and trustworthy. This approach also reduces risks like misalignment or hidden changes by baking safety and oversight into every step. Finally, managed openness and reproducibility invite society to verify claims and build on them, spreading benefits beyond single labs.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you learn a new sport, at first you practice the basics, and later you start designing your own drills to get better faster? That’s kind of how AI has grown: at first, people taught AIs on fixed lessons, and now AIs are starting to design some of their own practice.

🍞 Hook: Imagine a robot student who can study by itself all night. 🥬 The Concept: Self-Improving AI is a computer program that tries to make itself better without needing people to guide every step. How it works: (1) The AI looks at how well it’s doing. (2) It changes its inner settings (like its brain “knobs”). (3) It may create tasks or feedback for itself. (4) It repeats this loop. Why it matters: Without self-improvement, AI stays stuck at its starting level and can’t keep up with new challenges. 🍞 Anchor: Just like a video game character that levels up by practicing, a self-improving AI gets stronger by training on more and smarter practice.

For years, most progress was about turning the “knobs” (weights) on one fixed model shape. This worked great when models got bigger and had more data. Recently, AIs also started helping generate their own practice data, challenge themselves with harder tasks, and even judge their own answers—like a student writing practice tests, taking them, and grading them too. There are early projects where AIs try to change their own code or design—just like a student attempting to redesign their own study plan and even their tools. That sounds powerful, but there’s a catch: if a system runs off on its own, it can run in the wrong direction.

🍞 Hook: You know how group projects go faster when partners divide tasks and check each other’s work? 🥬 The Concept: Co-Improvement is when humans and AI help each other get better at research and problem-solving. How it works: (1) Humans and AI pick goals together. (2) They brainstorm ideas together. (3) They test and fix things together. (4) They learn from each round together. Why it matters: Without co-improvement, AI might get good at the wrong things, and humans might move too slowly or miss creative angles. 🍞 Anchor: Like a science fair team where one person is great at ideas and the other is great at running experiments, the team wins because they combine strengths.

Before this paper, many groups chased fully autonomous self-improvement: “Let the AI learn everything itself and even redesign itself.” That dream is exciting, but risky. If it learns the wrong goals, it can become very capable and still do things we don’t want. Worse, people might not see how or why it changed. The authors argue there is a better path right now: build AIs that are amazing collaborators so they can speed up our science while we keep our hands on the steering wheel.

🍞 Hook: Think about a soccer team where each player knows when to pass. 🥬 The Concept: AI Collaboration is AI working with humans as teammates to reach goals better than either could alone. How it works: (1) Share the goal. (2) Split the tasks based on strengths. (3) Give feedback often. (4) Improve the plan together. Why it matters: Without collaboration, we lose the mix of human wisdom and AI speed/scale, leading to slower and riskier progress. 🍞 Anchor: Like a goalie and a striker helping each other, AI can defend against errors while humans take creative shots.

🍞 Hook: Picture a tool that was made to fit your hand and help you, not to replace you. 🥬 The Concept: Human-Centered AI means designing AI with people’s needs and values at the center. How it works: (1) Understand what people actually need. (2) Build systems that support humans, not sideline them. (3) Keep humans in charge of important choices. (4) Measure success by human benefit. Why it matters: Without human-centered design, even smart AI can harm trust, jobs, or safety. 🍞 Anchor: Like a bicycle that makes you faster but you still steer it, human-centered AI boosts you without taking control away.

🍞 Hook: You always wear a seatbelt, not because you plan to crash, but to be safe if you do. 🥬 The Concept: AI Safety is making sure AI systems don’t hurt people and follow the rules we care about. How it works: (1) Set clear values and limits. (2) Test for dangerous behaviors. (3) Add training and guardrails. (4) Watch, audit, and update over time. Why it matters: Without safety, powerful AI could be misused or make harmful decisions even when it “means well.” 🍞 Anchor: Like a kitchen with oven mitts, smoke alarms, and rules about knives, safety lets you cook amazing meals without burns or fires.

🍞 Hook: Imagine getting a practice test to know how ready you are. 🥬 The Concept: Benchmark Testing checks how well an AI performs using shared tests and clear scores. How it works: (1) Define what “good” looks like. (2) Build challenges that measure it. (3) Run the AI and record scores. (4) Improve training based on weak spots. Why it matters: Without benchmarks, we guess about progress, and guesses can be wrong. 🍞 Anchor: Like spelling tests that show which words you miss, benchmarks show where AI needs help.

🍞 Hook: Think about how a chef and a sous-chef make better meals together than either alone. 🥬 The Concept: Human-AI Symbiosis is the idea that people and AIs, working together, can achieve more than either could separately. How it works: (1) Humans bring judgment, ethics, context. (2) AIs bring memory, speed, and pattern-finding. (3) They share work and feedback. (4) Both improve through the partnership. Why it matters: Without symbiosis, we underuse AI’s strengths and risk ignoring human wisdom. 🍞 Anchor: Like a duet in music—each voice makes the other sound richer.

🍞 Hook: Imagine a super team where every player grows stronger by practicing together every season. 🥬 The Concept: Co-Superintelligence is a partnership where humans and AI become extremely capable together through continuous collaboration. How it works: (1) Train collaboration skills, not just solo skills. (2) Use shared benchmarks for research teamwork. (3) Keep humans steering values and goals. (4) Scale up as both sides learn. Why it matters: Without this, super-smart systems might leave humans behind—in speed, control, or ethics. 🍞 Anchor: Like a world-class orchestra that keeps perfecting its music, co-superintelligence is about raising everyone’s game, not replacing the players.

The world before: AI grew fast by scaling models and data, and by letting AIs craft some of their own practice. The problem: autonomy without guidance risks misalignment and harms. Failed attempts: focusing mainly on solo AI self-improvement misses safety, transparency, and human intention. The gap: we lack AIs trained and tested to be top-notch research partners across the full science process. The stakes: in everyday life, safer and better-aligned AI means fewer harmful outputs, better medical or educational tools, more trustworthy assistants, and a fairer say for humans as technology advances.

02Core Idea

The “Aha!” in one sentence: Train and evaluate AI explicitly as a human research partner across the whole research cycle so that humans and AI co-improve toward safer co-superintelligence.

Three analogies:

Workshop analogy: Instead of building a robot that replaces every craftsperson, build power tools that work in your hands—together you craft faster, safer, and better.
Orchestra analogy: Don’t replace the orchestra with a metronome; train an AI conductor and section leaders who rehearse with musicians so the whole group becomes world-class.
Hiking guide analogy: Rather than sending a speedy hiker alone into unknown mountains, pair them with a wise guide; they navigate faster and avoid cliffs.

Before vs After:

Before: The field often tried to remove humans from the loop—letting AI invent data, grade itself, and even rewrite code with minimal oversight.
After: We keep humans in the driver’s seat, and we train AI to be the world’s best copilot for research: co-identifying problems, co-building benchmarks, co-creating methods, co-running experiments, and co-analyzing errors—with shared visibility and values.

Why it works (intuition, not equations):

Complementary strengths: Humans excel at judgment, ethics, and framing the right questions. AIs excel at scale, memory, and exploring huge idea spaces. Jointly, they close each other’s gaps.
Steering and safety: If humans are built into the improvement loop, missteps are noticed earlier, values can be updated, and dangerous directions can be paused.
Sample efficiency: AI can generate many candidate ideas; humans pick the promising few, saving time and compute.
Learning to collaborate: As with any skill, collaboration improves when you practice and measure it—so we create tasks and feedback aimed at teamwork itself.

Building blocks (the recipe pieces):

Collaborative problem identification: Humans and AI define goals, list failures, find gaps, and map prior work together.
Benchmark creation and evaluation: Together they decide what “good” means, build tests, and refine them.
Method innovation and idea generation: Brainstorm systems, algorithms, data strategies, and code designs.
Joint experiment design: Plan ablations, datasets, metrics, and protocols.
Collaborative execution: Share multi-step workflows, implement code, and run experiments.
Evaluation and error analysis: Diagnose successes and failures at scale; feed insights back to the next round.
Safety and alignment co-design: Co-develop values, constitutions, and tests for harmful behaviors.
Systems and infrastructure: Co-architect pipelines, configs, and reproducibility steps.
Integration into real-world systems: Translate lab wins into practical use and gather new requirements.
Scientific communication: Co-draft reports, figures, and explainers for clarity and peer review.
Collective intelligence and group research: Support many humans and AIs collaborating, debating, and synthesizing.
Bidirectional co-improvement: Ensure both the humans and the AIs become more capable over time.

What changes in practice:

We measure collaboration as a first-class skill, not a side effect of general capability.
We construct training data that actually teaches collaboration (e.g., paired ideation, critique cycles, consensus-building tasks).
We design evaluation tasks that reward process quality (clear plans, thoughtful ablations, correct error analyses), not just artifact counts.
We normalize managed openness and reproducibility to build public trust and shared knowledge.

In short, the key insight is to point the spotlight at the partnership itself—teaching, testing, and improving how AI and humans think and build together—so that the path to extreme capability remains a path that keeps humanity in charge and in benefit.

03Methodology

At a high level: Human research goal → (Co-Identify problems) → (Co-Benchmark) → (Co-Ideate methods) → (Co-Design experiments) → (Co-Execute workflows) → (Co-Evaluate and analyze errors) → (Safety co-design all along) → (Integrate and communicate) → Shared learning for humans and AI.

Step 1: Collaborative problem identification

What happens: A human states a research aim (e.g., “We need more reliable reasoning”). The AI scans prior work, surfaces failure cases, proposes unexplored directions, and helps narrow to a crisp problem statement.
Why it exists: Without a clear, shared target, teams waste time on fuzzy goals or reinvent old ideas.
Example: Human asks, “Why do models get jailbroken?” AI returns clusters of jailbreak tactics, highlights weak spots in refusal policies, and suggests new guardrail angles (e.g., context-aware defenses).

Step 2: Benchmark creation and problem evaluation

What happens: Human and AI draft what “good performance” means—metrics, datasets, tasks, and acceptance criteria—and assemble initial benchmarks. They also set up scoring scripts and check for leakage or bias.
Why it exists: Without trusted benchmarks, we can’t tell if we’re actually getting safer or smarter.
Example: For safe refusal with helpfulness, they build a mixed set with harmless user asks, tricky edge cases, and known jailbreak prompts, then define dual metrics: helpfulness score and rule-violation rate.

Step 3: Method innovation and idea generation

What happens: The pair brainstorm system designs (e.g., tool use, planning modules), training strategies (e.g., targeted finetuning), and data recipes (e.g., curated counterexamples). The AI proposes many candidates; the human prunes and refines.
Why it exists: Great ideas often come from breadth+filtering; AI supplies breadth, humans supply taste.
Example: AI lists 15 defense strategies; the human picks 3 complementary ones (context-aware refusal, explanation-first responses, and role-based self-check) to prototype.

Step 4: Joint experiment design

What happens: They sketch ablations (what to vary and measure), choose datasets (train/dev/test splits), decide compute budgets, and pre-register expected outcomes to reduce bias.
Why it exists: Without a clean plan, experiments can be messy or inconclusive.
Example: Plan a 2x3 grid: with/without explanation-first + three refusal training sets; measure utility and safety trade-offs.

Step 5: Collaborative execution

What happens: The AI drafts code, configs, and scripts; the human reviews, edits, and green-lights runs. The AI monitors logs for anomalies and suggests fixes. Versioning and reproducibility tools keep everything trackable.
Why it exists: Execution needs both speed (AI) and oversight (human) to avoid silent bugs and wasted compute.
Example: AI suggests a faster data-loader; human checks correctness and approves; training completes 30% quicker.

Step 6: Evaluation and error analysis

What happens: They run the agreed benchmarks, collect metrics, and perform fine-grained error analysis. The AI clusters failure cases; the human interprets patterns and ethical implications.
Why it exists: Raw scores don’t explain why things failed; analysis points to the next fix.
Example: Failures concentrate on multi-step harmful role-play requests; insight: add scenario-sensitive guidance prompts and train on counterexamples.

Step 7: Safety and alignment co-design (woven throughout)

What happens: They codify values (a “constitution”), define red lines, and create targeted red-teaming tasks. Findings feed back into methods and benchmarks.
Why it exists: Safety isn’t a final coat of paint—it’s part of the frame.
Example: After spotting ambiguous medical requests, they add a policy: require disclaimers and suggest seeing a professional, then test for compliance.

Step 8: Systems and infrastructure co-design

What happens: Human and AI co-optimize pipelines for speed, cost, and traceability (artifact tracking, dataset cards, eval dashboards).
Why it exists: Without good plumbing, even great ideas get stuck.
Example: AI proposes caching intermediate model states; rebuilds only what changed; iteration speeds up.

Step 9: Integration into real-world systems

What happens: Lab findings are piloted in a product or policy workflow; telemetry (within privacy rules) reveals new edge cases; those become new benchmarks.
Why it exists: Real users surface problems labs miss; integration closes the loop.
Example: A classroom assistant shows confusion on multi-lingual queries; the team adds multilingual safety/eval sets.

Step 10: Scientific communication

What happens: AI helps draft clear reports, charts, and reproducible folders; humans ensure accuracy, ethics, and readability for peers.
Why it exists: Science only counts if others can check and build on it.
Example: A public report includes data cards, code, and a how-to-reproduce checklist.

Step 11: Collective intelligence and group research

What happens: Multiple humans and AIs collaborate, debate, and synthesize. Tools help track viewpoints, consensus, and remaining disagreements.
Why it exists: Hard problems need many eyes, but coordination is hard without structure.
Example: A roundtable system gathers 5 AI proposals, 3 human critiques, and produces a merged plan with who-does-what.

The secret sauce:

Train on collaboration itself: Build datasets where the “right answer” includes a good plan, a solid critique, or a helpful ablation—not only a final number.
Evaluate processes, not just products: Score ideation diversity, plan coherence, error-analysis quality, and safety adherence.
Keep bidirectional learning: Humans gain skills (better prompts, sharper analysis) and AIs gain better priors about how to assist; both improve together.
Use managed openness: Share artifacts so others can reproduce, catch issues, and extend the work—accelerating safe progress for all.

04Experiments & Results

The paper is a position piece, so it does not report new numbers. But it clearly lays out what the right tests should look like and why they matter.

The test: What should we measure?

Collaborative problem identification: Can AI surface relevant failures, prior art, and promising directions that humans judge as novel and useful?
Benchmark building: Can AI help define fair metrics, generate hard-yet-valid test cases, and spot leakage or bias?
Method ideation: Does AI produce diverse, non-duplicative ideas that survive human triage and lead to working prototypes?
Experiment design: Are plans concrete, measurable, and pre-registered with reasonable compute budgets?
Execution quality: Does AI produce correct, efficient, and reproducible code and configs with minimal human fixes?
Evaluation and error analysis: Does AI cluster failures meaningfully, propose hypotheses, and suggest targeted next steps?
Safety co-design: Can AI help craft constitutions/policies and red-team tasks that actually reduce harmful behaviors without killing helpfulness?
Collective research: In multi-human+multi-AI settings, do structured debates and syntheses speed up reaching a strong, testable consensus?

The competition: What baselines should we compare against?

Human-only teams: Skilled researchers without AI support.
Autonomous agents: AI systems that write papers or run experiments mostly alone, with minimal human steering.
General-purpose LLM assistants: Strong models not specially trained for research collaboration.

The scoreboard (with context, not fabricated numbers):

If co-improvement works, we expect human+AI teams to ship higher-quality experiments faster than human-only teams, and to be more trustworthy and steerable than fully autonomous agents. Think “like getting an A when others get B’s” not just in raw task accuracy, but in plan clarity, error analysis depth, and safety adherence.
On ideation diversity, cooperative AI should produce broader, less redundant idea sets compared to single agents—like a brainstorming room with many fresh angles instead of echoing the same thought.
On benchmark quality, co-designed tests should catch more subtle failure modes and reduce overfitting—similar to teachers writing better exams that reveal true understanding rather than memorization.
On safety metrics, co-designed policies and red-teaming should lower harmful behaviors while keeping helpfulness high—like improving sports defense without making the team stop scoring.

Surprising or likely findings (based on prior trends):

Process beats product: Teams that invest in plan quality and error analysis often unlock bigger downstream gains than teams that rush to final results.
Diversity matters: In multi-agent ideation, a small nudge for diversity can outpace a single “smartest” agent, echoing findings that idea variety fuels breakthroughs.
Real-world integration feeds discovery: Deploying in controlled settings reveals edge cases that become new, high-impact research directions—a virtuous loop.
Safety synergies: When safety is woven into every stage (not just a final filter), capability can improve too; clearer goals and feedback reduce wasted learning.

How to run these tests today:

Adapt existing agent benchmarks (e.g., code, research replication) to score collaboration skills like plan quality and ablation rigor.
Build human-in-the-loop leaderboards where researchers rate usefulness, novelty, and safety of AI contributions.
Track data cards, reproducibility, and open artifacts as first-class metrics, not just add-ons.

In summary, while the paper reports no new experimental numbers, it lays out a concrete, measurable agenda. Success looks like human+AI teams that plan better, learn faster, and stay safer than either humans alone or autonomous agents racing ahead without supervision.

05Discussion & Limitations

Limitations:

Position, not proof: The paper argues for a direction but doesn’t include new experiments to validate specific gains yet.
Benchmarking challenge: Measuring collaboration quality (e.g., idea novelty, plan coherence, error-analysis depth) is subtle and may be noisy or subjective.
Human factors: Power dynamics, biases, or over-reliance on AI could skew choices. Good collaboration design needs careful UX and governance.
Scalability tension: Human-in-the-loop can feel slower than full autonomy; we need smart tooling to keep throughput high without losing oversight.
Safety scope: Co-improvement reduces some risks but doesn’t erase all (e.g., misuse by bad actors), so policy and access controls remain important.

Required resources:

Strong base models capable of research tasks (reading papers, coding, analysis).
New datasets that capture collaboration processes (plans, critiques, ablations) and not just final answers.
Human expertise for evaluation, safety review, and alignment of goals.
Infrastructure for reproducibility (versioning, eval dashboards) and managed openness (artifact sharing with guardrails).

When not to use:

High-speed, low-stakes tasks where human oversight adds little value and risks are minimal—simple automation might suffice.
Extreme safety-critical, real-time scenarios where any uncertainty is unacceptable (e.g., autonomous lethal force); broader governance and specialized controls are required beyond research collaboration.
Contexts lacking qualified human partners; co-improvement shines when humans can actually steer and evaluate.

Open questions:

Best metrics: What standardized, reliable metrics should score ideation diversity, plan quality, and error analysis across domains?
Training recipes: Which data and loss signals most effectively teach helpful critique, planning, and consensus building?
Multi-party dynamics: How should we orchestrate many humans and AIs so that debate leads to synthesis, not chaos?
Safety generalization: How well do co-designed safety methods transfer to new tasks, languages, and cultures?
Governance and openness: What levels of openness maximize learning and trust while minimizing misuse risks as capabilities grow?

Honest take: Co-improvement is not a magic shield, but it is a practical, near-term way to speed progress while keeping humans in charge of the goals and guardrails. The big bet is that teaching the partnership itself—how we think and build together—will pay off more than chasing solo autonomy.

06Conclusion & Future Work

Three-sentence summary: This paper argues that the fastest and safest path to very powerful AI is not to remove humans from the loop, but to explicitly train and evaluate AI as a collaborative research partner across the full scientific process. By doing so, humans and AIs co-improve together—finding better ideas faster, catching problems earlier, and keeping progress aligned with human values. The destination is co-superintelligence: a world where people and AIs are smarter together, with humans still steering.

Main achievement: Reframing the goal from self-improvement to co-improvement, and laying out concrete collaboration skills and benchmarks (problem finding, benchmarking, ideation, experiment design, execution, analysis, safety, systems, integration, communication, and group synthesis) as first-class research targets.

Future directions: Build and share collaboration-focused datasets and leaderboards; run head-to-head studies of human-only vs co-improvement vs autonomous agents; advance multi-human+multi-AI debate and synthesis; and integrate managed openness so results are reproducible and socially accountable. Keep safety woven through every stage, not bolted on at the end.

Why remember this: Because it centers humans—in speed, safety, and benefit—while still aiming high. It’s a roadmap to get powerful AI and keep it pointed at what people actually need, turning AI into the best lab partner humanity has ever had.

Practical Applications

•Create a benchmark where AI must co-write experiment plans with humans and is scored on clarity, testability, and safety.
•Build datasets of human-AI critiques and ablations so models learn to spot weaknesses and propose targeted fixes.
•Deploy a lab copilot that drafts reproducible code, configs, and data cards, with human approval gates for sensitive steps.
•Run multi-human+multi-AI ideation rooms that reward diversity and synthesis, not just the number of ideas.
•Integrate safety constitutions and red-team tasks into every training run, tracking both helpfulness and harm reduction.
•Adopt managed openness: publish reproducible artifacts (with guardrails) so other teams can verify and extend results.
•Pilot co-improvement in classrooms: AI helps teachers design lessons and assessments; teachers correct and guide the AI.
•Use co-designed evaluation dashboards that show not only final scores but plan quality, error clusters, and policy compliance.
•Set up real-world feedback loops (with privacy) so user edge cases become new benchmarks and training data.
•Establish governance check-ins where humans review AI-proposed research directions for ethics, safety, and societal value.

Version: 1