Training AI Co-Scientists Using Rubric Rewards

Shashwat Goel; Rishi Hazra; Dulhan Jayalath; Timon Willi; Parag Jain; William F. Shen; Ilias Leontiadis; Francesco Barbieri; Yoram Bachrach; Jonas Geiping; Chenxi Whitehouse

Training AI Co-Scientists Using Rubric Rewards

Intermediate

Shashwat Goel, Rishi Hazra, Dulhan Jayalath et al.12/29/2025

arXiv PDF

Key Summary

•The paper teaches AI to write strong research plans by letting it grade its own work using checklists (rubrics) pulled from real scientific papers.
•A frozen copy of the AI acts like a teacher with an answer key: it sees special rubrics and scores plans, while the plan-writing AI does not see those rubrics.
•This creates a helpful gap between the writer and the grader, so the AI learns to satisfy real requirements instead of guessing.
•They train with reinforcement learning (learning-by-reward) and control length so plans are complete but not super long.
•Experts preferred plans from the trained model 70% of the time on machine learning research goals.
•84% of the automatically extracted rubric items were judged necessary by human experts.
•The method works beyond ML: it improved plans on medical papers and new arXiv preprints by 12–22% relatively.
•Supervised fine-tuning (just copying reference solutions) actually made plans worse for this open-ended task.
•The approach scales because it uses published papers to make goals and rubrics automatically, needing far less human effort.
•This is a practical step toward general-purpose AI co-scientists that help humans plan better research.

Why This Research Matters

This work turns the world’s scientific papers into a training gym for planning, not just for memorizing facts. By grading plans against goal-specific checklists, AI learns what experts actually need, saving time, money, and effort. It helps teams avoid risky or wasteful steps by checking ethics, feasibility, and clarity before experiments begin. The method scales across fields—even where running tests is slow or impossible—so it can boost progress in areas like medicine and economics. As these co-scientists improve, students and researchers can brainstorm more rigorously and move from ideas to action faster.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re planning a school science fair project. You have a big idea, a deadline, a budget, and safety rules. Writing a good plan means listing steps, tools, tests, and how you’ll decide if it worked.

🥬 The World Before: For years, language models were good at clear-cut things—solving math problems, writing short code, or finding facts—because those answers are easy to check. But real science isn’t just one answer; it’s choosing a smart plan before you even run experiments. That’s much harder to judge quickly, because you can’t always press a button and see if the plan was correct.

What AI could do: crunch numbers, look up facts, write small programs, summarize papers.
What AI struggled with: creating research plans that follow many constraints (resources, ethics, methods) and also include the hidden must-haves experts expect.

🍞 Anchor: It’s like AI could solve math worksheets, but got stuck when asked to design the whole science project.

🍞 Hook: You know how planning a trip means more than picking a city? You must choose flights, hotels, costs, and a backup plan if something goes wrong.

🥬 The Problem: Scientists need AI “co-scientists” that can propose plans matching a goal with real constraints. But checking whether a plan is good by actually running all the experiments is slow, costly, and sometimes impossible (especially in medicine). Without quick, reliable feedback, training the AI to improve is tough.

If we try to judge plans by executing them, it takes too long and can be unsafe.
If we try to judge plans with generic rules, the AI may miss the special, goal-specific details that really matter.

🍞 Anchor: It’s like teaching travel planning by making the student actually take every trip. That’s too slow and expensive.

🍞 Hook: Think of a teacher’s grading checklist for a project presentation—clear intro, correct facts, good visuals, and time management.

🥬 Failed Attempts: Prior approaches built special digital sandboxes where AI could try many times and get exact scores (great for narrow tasks like games or protein folding). But most science topics don’t fit neatly into a simulator. Another idea was to fine-tune AI to imitate reference plans from papers. That looked promising but backfired: the AI copied style more than substance and missed key requirements.

End-to-end simulators: powerful but not general; many fields (like medicine) can’t be simulated faithfully.
Supervised fine-tuning (copying): made outputs look right but often failed to meet explicit requirements.

🍞 Anchor: Copying someone’s homework handwriting doesn’t mean you solved the problems.

🍞 Hook: Imagine you had a giant library of past science projects and each came with a custom “what a good plan must include” checklist.

🥬 The Gap: We lacked a scalable way to give the AI specific, reliable feedback without running experiments or hiring lots of experts. The missing ingredient was goal-specific checklists (rubrics) plus a way for AI to use them for learning, at scale.

🍞 Anchor: If every past project gave you a tailored grading sheet, you could learn what makes a great plan for many different kinds of projects.

🍞 Hook: Think of saving time and money by planning well before you build.

🥬 Real Stakes: Better research plans mean fewer wasted experiments, safer choices in sensitive areas (like medicine), and faster progress on new ideas. Students and scientists can brainstorm smarter, spot weaknesses early, and coordinate efforts more efficiently.

Saves resources: fewer dead-end trials.
Improves safety and ethics: fewer risky or impractical steps.
Boosts learning: clearer steps for junior researchers.

🍞 Anchor: A solid plan is like a good recipe—you shop once, cook confidently, and eat well, instead of making five messy, costly tries.

02Core Idea

🍞 Hook: You know how a spelling test is easier to grade with an answer key? The teacher sees the key, but the student doesn’t during the test.

🥬 The “Aha!” Moment (one sentence): Train an AI to write research plans by letting a frozen copy of itself grade those plans with goal-specific rubrics extracted from real papers—so the grader has privileged checklists while the writer learns to meet them.

Why this matters: It’s much easier to check a plan against a checklist than to actually run the plan in the real world.
Key pieces: automatic rubric extraction, self-grading with privileged information, and reinforcement learning that rewards satisfying rubric items.

🍞 Anchor: Like a coach with scorecards improving a gymnast’s routine without entering a full competition every time.

Multiple analogies:

Teacher & Answer Key: The grader has the answer key (rubric), the student (plan-writer) doesn’t. The student learns from scores, not from seeing the key.
Baking Judge: A judge knows the secret checklist (proper bake, crumb, flavor balance). The baker only sees feedback like “underdone center,” then improves the next cake.
Airport Safety Checks: Inspectors have a full checklist. Passengers don’t see the list, but flights get safer because airplanes are repeatedly checked against it.

Before vs After:

Before: AI produced plans that were often vague, missed constraints, or over-focused on style.
After: AI writes plans that better satisfy specific, mission-critical requirements and hidden must-haves from the paper’s context.

Why it works (intuition, no equations):

Verification is easier than generation: Checking if a plan includes required parts is simpler than inventing the plan.
Privileged information: The grader sees the rubric; the writer doesn’t. This creates a constructive challenge that pushes the writer to truly cover requirements.
Structured feedback: Scoring per rubric item (with noted guideline violations) gives clear signals about where to improve.

Building blocks (each explained with the Sandwich pattern):

🍞 AI Co-Scientist 🥬 What: An AI helper that suggests research plans for human scientists. How: Reads a goal with constraints, proposes steps, methods, evaluations, and safeguards. Why: Without it, humans spend more time drafting, miss blind spots, and repeat effort. 🍞 Example: A biology lab asks for a plan to test a new therapy’s safety markers; the AI outlines experiments, controls, and analysis.
🍞 Research Plan Generation 🥬 What: The AI’s skill of writing a step-by-step blueprint for a study. How: Parse the goal → list hypotheses → design experiments → choose metrics → plan analyses → check ethics and feasibility. Why: Without a plan, teams waste time, money, and may get unreliable conclusions. 🍞 Example: Planning a fairness study for an ML classifier with datasets, metrics, baselines, and ablations.
🍞 Goal-Specific Grading Rubrics 🥬 What: A custom checklist of must-haves for a particular research goal. How: Extract from a paper’s context: constraints, evaluation needs, safety checks, and core steps; keep 10 high-quality items. Why: Without them, grading is vague and the AI can’t learn what truly matters for this goal. 🍞 Example: “Plan must include robustness tests against data drift” for a medical imaging study.
🍞 General Guidelines 🥬 What: Seven broad quality rules (e.g., detailed, no overlooked flaws, cost-efficient). How: For each rubric item, the grader checks which guidelines are violated. Why: Without them, the plan might technically tick boxes but still be sloppy, risky, or wasteful. 🍞 Example: A plan that says “use more data” fails the “detailed, specific solution” rule without explaining how.
🍞 Self-Grading 🥬 What: A frozen copy of the initial model grades the new plans using rubrics. How: Grade each rubric item, list violated guidelines, tally satisfied items into a score. Why: Without it, we’d need lots of humans or slow real-world runs to provide feedback. 🍞 Example: The grader marks “no overlooked flaws” violated if confounders weren’t addressed.
🍞 Generator–Verifier Gap (Privileged Information) 🥬 What: The grader sees rubrics; the generator doesn’t. How: This makes verifying easier than generating, so even a modest grader helps the generator improve. Why: Without the gap, the generator could just parrot the rubric or overfit to it. 🍞 Example: The writer learns to cover typical must-haves because missing them lowers scores.
🍞 Reinforcement Learning (RL) with Rubric Rewards 🥬 What: A learning loop where higher rubric satisfaction earns higher reward. How: Generate multiple plans → grade each → nudge the model toward better-scored plans. Why: Without rewards tied to real requirements, the model won’t aim for what experts value. 🍞 Example: Plans that add rigorous baselines and ablations receive higher rewards and become more common.
🍞 Group Relative Policy Optimization (GRPO) 🥬 What: An RL method that compares a group of outputs for the same input and learns from their relative rewards (no separate value model). How: Sample several plans per goal, normalize rewards within the group, and update toward higher-reward ones. Why: Without stable updates, training can wobble or require extra networks. 🍞 Example: From eight candidate plans, the two best shape the next step more than the rest.
🍞 Length Control Strategy 🥬 What: Unlimited private thinking, but the final written plan must fit a strict word limit. How: Penalize outputs exceeding the limit; encourage concise completeness. Why: Without it, models bloat text to hit more rubric items and trick lenient graders. 🍞 Example: A 700-word cap forces the plan to be crisp: setup, measures, baselines, safety checks, and budget notes.
🍞 Reference Solution & Filtering 🥬 What: Create a high-quality reference plan (with paper context) for the grader’s calibration and filter candidate rubrics/goals for quality. How: Another model drafts candidates; a selector filters to the best goal, best 10 rubric items, and a solid reference. Why: Without strong data curation, the training signal gets noisy. 🍞 Example: From three candidate checklists, keep the one that best captures must-haves and avoids duplicates.

03Methodology

At a high level: Paper → [Extract insights] → [Make research goals + rubrics + reference] → [Filter best sample] → Input goal to plan-writer → [Generate multiple plans] → Grader with privileged rubrics scores each → RL update with GRPO → Output a concise, high-quality plan.

Step-by-step (with Sandwich explanations for each key concept):

Extracting ideas from papers (Insights → Goals) 🍞 Hook: Imagine mining a cookbook for the core cooking tricks used in each recipe. 🥬 What happens: A “sample creator” model reads a paper and pulls out up to three key insights—the clever ideas that made the study work. For each insight, it writes a research goal that feels like what the original authors faced before they ran experiments, including constraints and uncertainties. Why this step exists: Without realistic, well-posed goals, the AI would practice on toy problems and learn bad habits. 🍞 Example: From a tool-using LLM paper, the goal might be “Design a system that keeps tool documentation accurate and LLM-friendly as tools evolve.”
Building goal-specific rubrics (the must-have checklist) 🍞 Hook: Think of a teacher designing a rubric for a very specific science fair topic. 🥬 What happens: For each goal, the creator proposes ~15 rubric items that check essential features of a good plan (no trivialities, no unpredictable outcomes). Later, a selector trims these to the best 10. Why this step exists: Without tailor-made criteria, grading becomes too fuzzy to guide learning. 🍞 Example: “Plan must prevent overfitting to a single tool version,” or “Include exploration diversity across tool usage scenarios.”
Writing a reference solution (for calibration) 🍞 Hook: You know how having an example essay helps a grader understand what “good” looks like? 🥬 What happens: With access to the full paper and the preliminary rubric, the creator writes a reference plan. This creates a creator–solver gap: the creator sees more context than the plan-writer will see during training. Why this step exists: The grader calibrates against a strong example, making its judgments stricter and more consistent. 🍞 Example: A reference plan that clearly shows iterative updates to tool docs using logged failures, with safeguards and metrics.
Filtering for quality (keeping the best sample per paper) 🍞 Hook: Like choosing the best of three drafts before turning in your essay. 🥬 What happens: A “sample selector” model scores goals, rubrics, and the reference solution against guidelines. It keeps the most faithful, diverse, and useful set and picks the top 10 rubric items. Why this step exists: Without filtering, noisy goals or redundant rubrics weaken training. 🍞 Example: If two rubric items overlap, keep the clearer, more general one.
Grading with privileged information (rubrics + general guidelines) 🍞 Hook: Picture a judge with a secret checklist and a set of universal safety rules. 🥬 What happens: During training, a frozen copy of the initial model acts as the grader. The grader sees both the goal-specific rubrics and seven general guidelines (like “be specific,” “no overlooked flaws,” “cost-efficient”). For each rubric item, it identifies which guidelines (if any) were violated in the relevant part of the plan. An item is counted as satisfied only if no guideline is violated. The plan’s score is the fraction of items satisfied. Why this step exists: Without structured, grounded grading, scores would drift or reward fluff. 🍞 Example: If a plan claims “evaluate robustness” but gives no concrete tests or metrics, it violates “detailed, specific solution,” so that rubric item is not satisfied.
Reinforcement learning with GRPO (learning-by-reward) 🍞 Hook: Think of trying several approaches to a puzzle and keeping the ones that scored higher. 🥬 What happens: For each goal, the model generates a group of candidate plans. The grader scores each. GRPO updates the model to make higher-scoring plans more likely, normalizing scores within the group so we don’t need a separate value model. Why this step exists: Without a stable, scalable update rule, the model wouldn’t steadily improve. 🍞 Example: Among eight versions, the two most thorough plans (clear baselines, strong metrics, safety checks) guide the next learning step.
Length control strategy (concise final outputs) 🍞 Hook: You can think as long as you want, but your presentation must fit on one slide. 🥬 What happens: The model may reason privately with many tokens, but must put the final plan inside <solution> tags and under a strict word cap. Exceeding the cap earns a penalty. Why this step exists: Without it, the model would get verbose to hit more rubric items and game the grader. 🍞 Example: The plan must cover setup, data, baselines, ablations, risks, and metrics in ~750 words.
Secret sauce: The generator–verifier gap 🍞 Hook: A quiz is fairer if the teacher has the key but the student does not. 🥬 What happens: The grader knows the rubrics; the plan-writer doesn’t. Verifying is easier than generating, so even a “weaker” grader still produces directional signals that teach the writer to include critical elements. Why this step exists: Without the gap, the writer could memorize the rubric instead of learning to plan well. 🍞 Example: Over time, the writer starts adding the kinds of checks experts expect (e.g., confounders, baselines) to avoid losing points.
Guarding against over-optimization (watching a stronger judge) 🍞 Hook: If you practice only with a lenient referee, you might pick up bad habits. 🥬 What happens: The team monitors progress with stronger external judges (a jury of frontier models) to spot when the self-grader is being “fooled” by verbosity or unnecessary complexity. Why this step exists: Without this, the model could learn to please its own grader instead of learning what humans prefer. 🍞 Example: If the local grader likes long plans, but stronger judges (and humans) penalize bloat, training is stopped at the best checkpoint before bloat harms quality.
Putting it all together (input → output)

Input: An open-ended research goal, like “Devise a method to keep tool documentation LLM-friendly as tools evolve.”
Steps: Generate multiple plans → grade each against rubrics + guidelines → reward = fraction of items satisfied minus format penalty → GRPO update → enforce length limit.
Output: A crisp, feasible plan that addresses explicit constraints and the important implicit requirements.

Concrete mini-example (data point from the paper):

Goal: Improve LLM tool-use by dynamically refining tool docs from interaction logs.
Rubric items include: full automation, iterative updates from feedback, handling inaccuracies in docs, scalability, exploration diversity, stopping criteria to avoid overfitting, handling complex parameter ranges, and cross-model generalization.
A strong plan would: propose automated log mining, pattern extraction of failures, doc updates with versioning, safety checks, A/B tests against baselines, cost controls, and generalization tests across LLMs.
The grader: marks which items passed with no guideline violations; that fraction becomes the reward used to improve the next round.

04Experiments & Results

🍞 Hook: Imagine a school does a bake-off: students try recipes, judges use a checklist, and we see which training method makes better bakers.

🥬 The Tests (what and why):

Human expert study (ML domain): 25 experts compared pairs of plans (before vs after training) for 100 recent ML research goals. Why: humans are the ultimate audience of these plans.
Automated rubric grading (jury of stronger models): For ML, medical, and arXiv goals, three frontier models graded plans using the same structure. Why: fast, broad signals across domains.
Ablations: Tested pieces like supervised fine-tuning (SFT) vs RL, specific vs generic rubrics, KL penalty on/off, and grader strength. Why: discover what truly drives gains.

🍞 Anchor: It’s like judges (humans and top chefs) scoring cakes by a rubric to see which baking class works best.

The Competition (who vs who):

Baseline: Qwen3-30B-A3B-Instruct (initial model).
Our models: Same backbone, trained with rubric-reward RL per domain (ML, medical, arXiv).
Others: Open and closed frontier models (e.g., Grok-4-Thinking, GPT-5-Thinking) for context.

Scoreboard with context:

Human preferences (ML): Experts preferred the trained model’s plans 70% of the time (p < 0.0001). Average score rose from about 7.31/10 to 7.89/10. That’s like going from a solid B to a strong B+/A- on challenging, open-ended assignments.
Rubric item satisfaction (humans grading items): Trained model satisfied ~79.8% vs ~73.8% for baseline—humans were somewhat lenient but still saw clear gains.
Automated grading (jury): Across ML, medical, and arXiv goals, the trained models showed 12–22% relative improvements. This is like improving your win rate from mid-table to competing with strong performers.
Cross-domain generalization: A model trained on medical goals also improved on ML and arXiv goals, sometimes beating a domain-matched finetune—evidence that it learned general planning skills, not just domain trivia.
Frontier comparison: Our 30B trained model became competitive with some larger models (e.g., Grok-4-Thinking) but still trailed GPT-5-Thinking, which topped the charts.

Surprising or notable findings:

SFT hurts: Simply copying reference solutions degraded plan quality for this task, even though validation loss looked good. Style isn’t substance.
Verbosity trap: Without length control, the model gets long to score higher. The word cap keeps quality from being confused with quantity.
Over-optimization warning: The self-grader sometimes rewarded details that humans didn’t value (e.g., extra complexity). Tracking a stronger judge helped pick the best checkpoint before that drift.
Rubrics matter: Both specific (goal-tied) and generic (7 guidelines) checks were necessary. Dropping either reduced gains.
Grader strength helps: Using a stronger frozen grader (30B vs 4B) produced better training signals.
Rubric quality: Humans rated 84% of rubric items as necessary for a good plan—validating the automatic extraction pipeline.

Plain-language meaning:

The trained AI writes plans that are more likely to meet the real must-haves experts care about.
It gets better not by running experiments, but by practicing against smart, customized checklists pulled from real papers.
The skills transfer: learning how to plan well in one area helps in others, too.

🍞 Anchor: After training, our “student baker” makes cakes that judges and guests prefer more often—and not just for chocolate cakes (ML) but also fruit tarts (medical) and new recipes (arXiv).

05Discussion & Limitations

🍞 Hook: Even great plans can leave room for improvement, like a map that still needs local road updates.

🥬 Limitations (specific):

Human evaluation was done deeply only in ML; other domains used automated judges, which, while helpful, don’t perfectly match human consensus.
The self-grader sometimes rewards complexity and verbosity more than humans like; we mitigated this with a word cap and external judges, but it’s not perfect.
SFT failing here reminds us that imitating style is not enough for open-ended planning—so we rely on RL, which needs careful monitoring.
Resource needs: Training involved a 30B MoE model and multi-GPU nodes; smaller teams may find this heavy.

Required resources:

A capable base model (e.g., 4B–30B), GPUs (A100s for 30B runs), and access to strong judges (for development-time validation) or human experts for final checks.

When not to use:

If you have a faithful, fast simulator that gives exact execution feedback (e.g., some robotics or games), direct environment rewards may be better.
If the task must be ultra cost-minimal or extremely short outputs, rubric-RL overhead may be excessive.
If goals are trivial or fully objective (e.g., single-number answers), simpler training suffices.

Open questions:

Better graders: Can we align self-graders more tightly with human preferences, especially on cost/effort efficiency?
Richer feedback: Can we train using the structured guideline violations directly (not just a final score) for more sample-efficient learning?
Tool use: How does rubric-RL combine with agents that search, code, and run small tests during planning?
Human-in-the-loop: What’s the lightest, highest-impact way for humans to steer rubrics or reward shaping without high costs?
Evaluation: Can we design cheaper, quicker tests that predict human preference reliably across domains?

🍞 Anchor: Think of this as a strong first edition of a planning coach. It already helps a lot, and the next edition will learn to value simplicity and efficiency even more—just like a seasoned mentor would.

06Conclusion & Future Work

Three-sentence summary: This paper shows how to train an AI co-scientist to write better research plans by grading itself with goal-specific rubrics extracted from real papers. A frozen copy of the model acts as a rubric-aware grader, creating a generator–verifier gap that gives clear rewards for reinforcement learning, while a word cap keeps plans concise. Human experts and strong automated judges confirm meaningful gains across ML, medical, and arXiv goals, with notable cross-domain generalization.

Main achievement: A scalable, automated training recipe—rubric extraction + self-grading + RL (GRPO)—that reliably improves open-ended research plan quality without expensive, slow experiment execution.

Future directions: Build stronger graders aligned with humans, use structured feedback (which guidelines were violated) during learning, integrate tool use (search/code) into planning, and design faster, cheaper evaluations. Explore how rubric-RL can teach broader planning abilities in domains where ground-truth execution is slow or unsafe.

Why remember this: It turns the world’s scientific literature into a training ground for planning, not just for facts. By teaching AI to satisfy real checklists that experts care about, we move closer to trustworthy, general-purpose AI collaborators that help humans plan smarter, safer, and faster.

Practical Applications

•Use rubric-trained AI to draft initial research plans that PhD students refine and execute.
•Create tailored rubrics from new preprints to quickly assess competing project ideas in a lab meeting.
•Run AI plan checks for ethics, feasibility, and cost-efficiency before applying for grants.
•Automate baseline and ablation listings for ML experiments so nothing critical is forgotten.
•In medical studies, preflight plans for safety signals and confounders where live trials are costly.
•For industry R&D, stress-test plans against domain-specific constraints (compliance, latency, budget).
•Teach research methods classes with AI that critiques student project plans using goal-specific rubrics.
•Triage arXiv ideas: generate and grade plans to identify the most promising, well-structured directions.
•Build lightweight internal rubrics for company projects and have the AI enforce them during planning.
•Combine with tool-using agents so planning includes small executable probes (search/code) where safe.

Version: 1