RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation

Sunzhu Li; Jiale Zhao; Miteto Wei; Huimin Ren; Yang Zhou; Jingwen Yang; Shunyu Liu; Kaike Zhang; Wei Chen

RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation

Intermediate

Sunzhu Li, Jiale Zhao, Miteto Wei et al.1/13/2026

arXiv PDF

Key Summary

•RubricHub is a huge (about 110,000) collection of detailed grading guides (rubrics) for many kinds of questions like health, science, writing, and chat.
•The key idea is an automated coarse-to-fine process that starts broad and then sharpens the rules so they can tell great answers from merely good ones.
•It builds rubrics by looking at real example answers (response-grounded), following quality rules (principle-guided), and combining ideas from several strong models (multi-model aggregation).
•Then it makes the rubrics tougher in smart ways (difficulty evolution) so top models still get useful feedback instead of all getting near-perfect scores.
•These rubrics are used two ways: to pick the best training examples (RuFT) and to give step-by-step rewards during reinforcement learning (RuRL).
•With RubricHub, a mid-sized model (Qwen3-14B) beat larger or proprietary models on HealthBench (69.3 vs. 67.2 for GPT-5), showing the rubrics really help.
•The scoring stays stable and fair across models and domains, and even the best models do not max out, which means the rubrics remain challenging.
•Positive-only criteria worked better than mixing in negative penalties, and larger grader models agreed with humans more, improving reliability.
•Rubric-driven training costs extra compute and still needs better small graders, but it makes open-ended evaluation far more precise and scalable.

Why This Research Matters

Better rubrics mean clearer, fairer feedback for AI, which turns into safer and more helpful answers for people. In health, they push models to give plain-language guidance, list red flags, and encourage proper follow-up. In schools and workplaces, they help AI give focused writing feedback and follow tricky instructions exactly. Because the rubrics stay challenging, models don’t plateau early—they keep learning real-world skills. This approach also cuts bias by combining several viewpoints and anchoring rules to real examples. Overall, it makes everyday AI more trustworthy across many tasks.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re judging a school talent show. If you only say “Good” or “Bad,” it’s hard to help performers improve. But if you have a checklist—voice, timing, creativity—you can give fair, helpful feedback.

🥬 The Concept (Rubrics): A rubric is a clear list of things to check when judging an answer. How it works: 1) List the important parts to look for, 2) Check each part one by one, 3) Add up the results to make a fair score. Why it matters: Without rubrics, judgments get messy and random; people (and AIs) don’t know what to fix.

🍞 Anchor: A poem rubric might check imagery, rhythm, and emotion. If rhythm is missing, you can point it out and improve.

The World Before: Big AI models could do math and coding well when there was a single, checkable right answer (like “Does the code pass all tests?”). That’s called verifiable rewards. But most real questions are open-ended—like “Explain this symptom kindly” or “Write a strong introduction.” There isn’t one perfect answer, so the AI can’t easily be told exactly how well it did.

🍞 Hook: You know how in a game, you get points for specific actions (coins, time, secrets found)? It’s easier to get better when your score reflects what you did.

🥬 The Concept (Reinforcement Learning with Verifiable Rewards, RLVR): This is a training style where the AI gets clear, checkable feedback. How it works: 1) Do a task, 2) Check if the outcome is provably correct, 3) Reward or adjust. Why it matters: It works great for math and code, but fails when there’s no single truth to check.

🍞 Anchor: In coding, unit tests give pass/fail signals. In poetry, there’s no unit test—so RLVR struggles.

The Problem: For open-ended tasks, people tried two main things. First, they asked a judge model to give a single overall score (like 1–10). That was unstable—scores bounced around and were biased. Second, they wrote rubrics by hand. But that took lots of expert time, covered few topics, and the rules were too broad. Many decent answers all scored similarly high. This “ceiling effect” meant the AI couldn’t tell how to get from good to great.

🍞 Hook: Think of a spelling bee with only one rule: “Spell something.” Almost everyone passes, and no one improves.

🥬 The Concept (Fine-grained Evaluation): This means breaking big goals into small, checkable parts. How it works: 1) Split quality into tiny pieces, 2) Check each piece, 3) Combine the checks for a precise score. Why it matters: Without fine-grained checks, many answers tie at the top, and training stalls.

🍞 Anchor: An essay graded on thesis clarity, evidence, structure, and tone gives pinpointed guidance.

Failed Attempts: Manually building lots of great rubrics is slow and expensive. Asking a single model to generate rubrics led to generic or biased criteria. And making criteria harsher without a plan often made them unfair or noisy.

The Gap: We needed rubrics that were 1) easy to create at scale, 2) matched the actual question and example answers, 3) captured subtle differences between good and excellent, and 4) worked across many domains.

🍞 Hook: You know how a class project gets better when you combine ideas from several classmates and then raise the challenge as you learn?

🥬 The Concept (Coarse-to-Fine Rubric Generation): Start with simple, broad checks, then sharpen them step by step. How it works: 1) Generate initial criteria linked to real answers and quality principles, 2) Merge the best ideas from multiple models, 3) Evolve the criteria to be more discriminative so top answers can still be separated. Why it matters: Without this staged sharpening, rules stay generic or biased and quickly hit a ceiling.

🍞 Anchor: First you check “Is it about autumn?” Then you add “Does it show imagery instead of telling?” Then “Does it use a consistent rhythm?” Now you can tell excellent from outstanding.

Real Stakes: In health advice, a coarse rule like “Be accurate” isn’t enough; we also need “Warn about red flags,” “Use plain language,” and “Say when to see a doctor.” In writing, “Be creative” is too vague; we need “Show, don’t tell,” “Keep consistent tone,” and “Follow requested format.” Better rubrics help AIs learn safely, explain clearly, and serve people more reliably every day.

02Core Idea

🍞 Hook: Imagine a teacher who first gives you a simple checklist, then asks three other teachers what they’d add, and finally looks at top student essays to craft tougher, smarter checks that separate great from excellent.

🥬 The Concept (Aha! in one sentence): RubricHub automates a coarse-to-fine pipeline that builds, merges, and then hardens rubrics so they’re both comprehensive and highly discriminative for open-ended AI training.

How it works (big picture):

Response-grounded and principle-guided generation: Create criteria by looking at actual example answers and quality principles so rules match the real task.
Multi-model aggregation: Combine criteria from multiple strong models, keep genuine differences, remove duplicates, and reduce bias.
Difficulty evolution: Compare top responses and add stricter criteria that tease apart excellent vs. exceptional.

Why it matters: Without this, rules are too generic, models all get similar high scores, and improvement stalls (the supervision ceiling). With it, feedback stays sharp, stable, and useful—even for top models.

🍞 Anchor: For a nature essay, you start with “talk about nature,” then add “use vivid details,” then sharpen to “include three concrete cause–effect examples and a clear thesis.” Now the best essays stand out.

Multiple Analogies:

Chef analogy: Start with a base recipe (coarse checks), taste dishes from different chefs (multi-model), then create a master recipe with advanced techniques (difficulty evolution) that rewards true culinary skill.
Sports coaching: Begin with fundamentals (dribbling), learn from several coaches (aggregation), then add elite drills (evolution) to separate varsity from all-stars.
Game levels: Level 1 ensures basics, Level 2 merges best strategies, Level 3 introduces expert-only challenges that demand precision.

Before vs. After:

Before: Handwritten or single-model rubrics were narrow, biased, and often too soft. Many answers tied at the top.
After: Automated, multi-model, and evolved rubrics stay relevant, fair, and tough—so models keep learning instead of plateauing.

🍞 Hook: You know how a fair judge must both know the rules and see what really happened on the field?

🥬 The Concept (Response-Grounded): Criteria are built by looking at real example answers, not just the question. How it works: 1) Read a sample answer, 2) Spot what quality looks like there, 3) Turn those into clear checks. Why it matters: Without this, rules drift into vague or irrelevant territory.

🍞 Anchor: If a sample medical reply carefully lists red flags, the rubric adds a check for explicit red-flag warnings.

🍞 Hook: Imagine classroom rules guided by school principles like clarity, fairness, and safety.

🥬 The Concept (Principle-Guided): Use meta-principles—consistency, coverage, clarity, evaluability—to shape criteria. How it works: 1) Check the rubric against these principles, 2) Fix overlaps and vagueness, 3) Ensure each check is binary and testable. Why it matters: Without principles, rubrics get messy or contradictory.

🍞 Anchor: “Avoid vague words” keeps “be better” out and invites “state three causes with evidence.”

🍞 Hook: A team of judges is fairer than just one.

🥬 The Concept (Multi-Model Aggregation): Combine candidate rubrics from different strong models to capture diverse viewpoints. How it works: 1) Gather all criteria, 2) Merge only truly identical ones, 3) Keep differences to cover breadth. Why it matters: Without many perspectives, rubrics inherit one model’s blind spots.

🍞 Anchor: “Check grammar” and “Check spelling” stay separate; “Power on” and “Device powered up” merge.

🍞 Hook: When you master basics, your teacher raises the bar.

🥬 The Concept (Difficulty Evolution): Make stricter, sharper rules by studying top answers and finding subtle edges. How it works: 1) Compare two excellent answers, 2) Spot the nuance that makes one stronger, 3) Add a new criterion to capture that nuance. Why it matters: Without evolution, top answers all score the same, and learning stalls.

🍞 Anchor: “Include at least three cause–effect examples” separates vivid essays from merely clear ones.

Building Blocks:

Coarse-to-Fine Rubric Generation (the pipeline)
Response-Grounded & Principle-Guided Synthesis (relevance + cleanliness)
Multi-Model Aggregation (coverage + reduced bias)
Difficulty Evolution (discriminability + continued learning)
RuFT (Rubric-based data curation)
RuRL (Rubric-based reinforcement rewards)

Why it works (intuition): Each stage solves a different failure mode—drift, bias, saturation—so together the pipeline delivers rules that are relevant, broad, and challenging. That steady, precise feedback is exactly what models need to keep getting better.

03Methodology

At a high level: Input (many open-ended questions) → Stage 1: Generate candidate rubrics from real answers and principles → Stage 2: Aggregate across models to form a base rubric → Stage 3: Evolve difficulty using top answers → Output: Final rubric per question (RubricHub). Then, Use rubrics for training via RuFT (pick best examples) and RuRL (give reward signals).

🍞 Hook: You know how you learn a new sport: watch examples, collect tips from multiple coaches, then practice tougher drills.

🥬 The Concept (Coarse-to-Fine Pipeline): A three-step recipe to build better rubrics. How it works: 1) Generate with grounding and principles, 2) Merge across models, 3) Sharpen using top answers. Why it matters: Skipping a step causes drift, bias, or easy rubrics that max out too soon.

🍞 Anchor: For a “nature essay,” we first link to real essays, then merge coach tips, then add pro-level criteria like “central thesis” and “three cause–effect examples.”

Stage 1: Response-Grounded & Principle-Guided Generation

What happens: For each question, the system looks at a real example answer and a set of quality principles (consistency, coverage, clarity, evaluability). It then writes a set of small, checkable criteria with weights.
Why it exists: Generating from the question alone can produce vague, off-target rules. Grounding in a response keeps criteria practical; principles keep them clean and non-overlapping.
Example: Question: “Write a 500-word essay about nature.” Grounded criteria include “Word count 450–550,” “Clear thesis about nature,” “At least three concrete observations,” and “Logical structure.”

Stage 2: Multi-Model Aggregation

What happens: Several strong models generate their own criteria. The system then merges them conservatively: only identical meaning gets merged; different details stay.
Why it exists: One model might miss an angle (e.g., tone), another might notice it. Keeping distinct, non-duplicate criteria widens coverage and reduces single-model bias.
Example: If Rubric A says “Use vivid images,” and Rubric B says “Use sensory details (sight, sound, smell),” they might stay separate if scopes differ; “Power on” vs. “Device powered up” would merge.

Stage 3: Difficulty Evolution

What happens: The system finds two top-quality answers and asks, “What subtle things make Answer 1 stronger than Answer 2?” It then adds new, stricter criteria to capture those nuances.
Why it exists: Basic checks can’t separate great from excellent. Tougher, targeted checks keep scores spread out for high performers, preventing a ceiling effect.
Example: Upgrade from “Has examples” to “Includes at least three cause–effect examples that support the thesis.”

Building RubricHub

Inputs: About 110k cleaned, open-ended questions across five domains: Medical, Science, Writing, Instruction Following, and Chat.
Outputs: For each question, a final rubric with many fine-grained, weighted criteria (often 25–32 items in complex domains like Writing and Medical).
Why it matters: With more criteria and clearer checks, score distributions spread out. Different models land at different score levels, confirming the rubric’s discriminative power.

Using Rubrics for Training

🍞 Hook: When studying, you keep your best notes and also do practice that gives instant feedback.

🥬 The Concept (RuFT — Rubric-based Rejection Sampling Fine-Tuning): Curate top-quality training data by sampling multiple answers and keeping only the best-scoring ones per rubric. How it works: 1) Generate several candidate answers, 2) Score each against the rubric, 3) Keep the highest above a threshold, 4) Train on these high-quality pairs. Why it matters: Without this filter, training data includes weak examples that teach bad habits.

🍞 Anchor: If you write six essays, you keep the strongest one (measured by the rubric) for your portfolio.

🥬 The Concept (RuRL — Rubric-based Reinforcement Learning): Use rubric criteria as step-by-step rewards while the model learns. How it works: 1) For each criterion, a grader (rules for simple checks or a strong LLM for semantic checks) decides pass/fail, 2) Add up weighted passes to form a reward, 3) Optimize the model with RL (e.g., DAPO). Why it matters: Without dense, structured rewards, the model gets fuzzy signals and learns slowly.

🍞 Anchor: During practice, you get points for thesis clarity, structure, evidence, and tone—so you know exactly what to fix next.

Grading Details (simple, no equations)

Two grader types: Rule-based for objective checks (like word count, presence of a heading), and LLM-based for semantic checks (like tone, reasoning depth).
Binary scoring: Each criterion is either met or not met; weights say how important it is. Summing weighted passes gives a stable, dense reward between 0 and 1.
Positive-only works best: Adding negative penalty criteria made learning noisier; training with only positive-weighted checks was more stable and higher-scoring in medical tests.

Secret Sauce

Response-grounding prevents drift.
Multi-model aggregation reduces bias and widens coverage.
Difficulty evolution preserves headroom for growth.
Positive-only rewards stabilize RL optimization.

Concrete Mini Walkthrough

Pick a question: “Explain why a post-surgery ankle turns red when lowered.”
Stage 1 creates criteria: explain gravity pooling, mention post-surgery circulation, list red flags, advise follow-up, use plain language.
Stage 2 merges in extra angles: exact phrasing for medical cautions, clarity about normal vs. danger signs.
Stage 3 adds finer checks: “Defines ‘dependent rubor’ explicitly,” “Lists at least four warning signs,” “Organizes advice into sections.”
RuFT samples multiple answers, keeps the best one that meets many criteria.
RuRL trains with binary checks per item guiding steady improvement.

Result: A challenging, fair rubric that helps the model write safer, clearer, and more complete medical explanations.

04Experiments & Results

The Test: The team evaluated across five domains to see if the rubrics actually improve real-world skills.

Medical: HealthBench and LLMEval-Med check safety, accuracy, and communication.
Science: ResearchQA and GPQA-Diamond test tough knowledge and reasoning.
Instruction Following: IFEval and IFBench check rule-following and structure.
Writing: WritingBench and CreateWriting-V3 test coherence, creativity, and style.
Chat: Arena-Hard V2 and internal surveys test overall helpfulness and multi-turn quality.

🍞 Hook: It’s like testing athletes in speed, strength, agility, strategy, and teamwork—not just one event.

🥬 The Concept (Scoreboard with Context): Compare the same base model trained different ways and against other strong models. How it works: 1) Start from a base (untrained) model, 2) Add RuFT (filtered data), 3) Add RuRL (rubric rewards), 4) Use both (RuFT→RuRL). Why it matters: If scores climb with each stage, the recipe works.

🍞 Anchor: On Arena-Hard V2 chat, Qwen3-14B jumped from 5.2 (base) to 74.4 (with both stages)—like going from benched to all-star.

Main Highlights

HealthBench (Medical): Qwen3-14B with RubricHub reached 69.3, beating GPT-5 at 67.2—impressive for a smaller, open model.
Consistent Gains: Across both 4B and 14B backbones, the order Base < RuFT < RuRL < RuFT→RuRL held across domains.
Chat Boost: The biggest leap appeared in general chat (Arena-Hard V2): from 5.2 to 74.4 after full pipeline.
Replacing Older Rubrics: Regenerating RaR rubrics using this pipeline improved results further, showing that better rubrics alone raise performance (e.g., HealthBench from 47.7 to 62.1 before full pipeline).

Surprising Findings

Positive-only criteria outperformed mixes with negative penalties in medical RL: HealthBench 66.2 vs. 63.2, LLMEval-Med 75.3 vs. 74.2. Simpler, positive checks gave steadier learning.
Grader Reliability: Agreement with human judgments rose with grader size and then leveled off (F1 up to ~0.90, Cohen’s κ around 0.74–0.80). Bigger isn’t endlessly better—after a point, gains are tiny.
Headroom Preserved: Even strong models averaged only about 0.6 on evolved rubrics, proving the criteria remain challenging and avoid saturation.

Competition Landscape

Against proprietary models (Gemini 3 Pro Preview, GPT-4.1, GPT-5) and other rubric-trained systems (Baichuan-M2-32B, Rubicon-Preview), the Qwen3-14B with RubricHub-based training was competitive or better in multiple domains, especially Medical, Instruction Following, and Chat.

Training Dynamics

Balanced Growth: Scores for Completeness, Accuracy, Communication, Context, and Instruction Following rose together, suggesting the model learned holistically, not just gaming a single metric.

Takeaway: The coarse-to-fine rubrics reliably turn into better training signals. With RuFT and RuRL, models get not just higher scores but more dependable, explainable improvements—like going from a B- average to an A across many classes, not just one.

05Discussion & Limitations

Limitations

Domain Scope: While strong in open-ended tasks (medical advice, writing, chat), coverage of purely verifiable domains like advanced math or competitive coding is limited. Long, multi-step agent tasks are also not deeply explored.
Grader Reliability and Size: Small graders struggle with subtle checks, and adding “pitfall” penalties introduces noise. High-quality grading currently leans on large, costly models, slowing iterations.
Efficiency: Rubric-based RL needs many rollouts and grader calls, creating latency and compute overhead. Even with parallelization, it’s resource-hungry.

Required Resources

A curated, multi-domain query pool (∼110k).
Several strong LLMs for candidate generation and aggregation (or access to their outputs).
A capable grader model (e.g., ~100B parameter class) for semantic checks, plus rule-based graders for simple checks.
GPUs for RL (e.g., multi-GPU training for DAPO-style optimization).

When NOT to Use

Tasks with fixed ground truth and simple verifiable checks (unit tests) may not need rubric complexity.
Extremely constrained compute budgets, where frequent LLM grading is infeasible.
Scenarios without clear criteria or where objectives rapidly change, making rubrics obsolete quickly.

Open Questions

Can we design compact, specialized grader models that match large-model reliability at far lower cost?
How can we auto-tune weights and criteria to reduce human oversight while avoiding overfitting?
Can difficulty evolution adapt online, personalizing rubrics per domain and per model skill level?
How well do these methods transfer to long-horizon agents and strictly verifiable domains like math and code without losing stability?

06Conclusion & Future Work

Three-Sentence Summary: RubricHub builds a massive, multi-domain set of fine-grained, discriminative rubrics using a coarse-to-fine pipeline: generate with grounding and principles, aggregate across models, and evolve difficulty using top answers. These rubrics power two training stages—RuFT for picking high-quality examples and RuRL for dense, structured rewards—leading to large, reliable gains across domains. A 14B model trained with RubricHub beats larger or proprietary models on key medical benchmarks, proving that better feedback can outweigh sheer size.

Main Achievement: Turning open-ended evaluation into a scalable, automated, and highly discriminative process that unlocks steady improvement and state-of-the-art results with smaller models.

Future Directions: Build compact high-precision graders, broaden to verifiable math/coding and long-horizon agent tasks, and improve efficiency via hybrid serial–parallel scoring and smarter sampling. Explore online rubric evolution and automatic weighting to keep criteria fresh and robust.

Why Remember This: In AI, the quality of feedback is power. RubricHub shows that clear, precise, and evolving rules can teach models to write safer medical advice, follow instructions, and communicate clearly—moving beyond “good enough” to “consistently excellent,” even in messy, open-ended worlds.

Practical Applications

•Build safer medical assistants that clearly warn about urgent symptoms and explain next steps in plain language.
•Create writing tutors that grade on thesis, evidence, structure, and style with concrete, actionable feedback.
•Improve instruction-following systems that must obey detailed formats, constraints, and multi-step rules.
•Curate higher-quality training data by sampling many answers and keeping only rubric-verified best ones (RuFT).
•Train models with dense, structured rewards (RuRL) to steadily raise quality across multiple skills at once.
•Regenerate weak or outdated rubrics in existing datasets to lift benchmark performance without new human labels.
•Design domain-specific evaluators that mix rule checks (format, length) with LLM checks (tone, reasoning).
•Build robust internal QA for chatbots so subtle flaws are caught and fixed before deployment.
•Support content moderation with fine-grained, transparent criteria for clarity, safety, and context.
•Assist educators by transforming assignment requirements into clear, testable rubrics for faster, fairer grading.

Version: 1