Self-Improving VLM Judges Without Human Annotations

Inna Wanyin Lin; Yushi Hu; Shuyue Stella Li; Scott Geng; Pang Wei Koh; Luke Zettlemoyer; Tim Althoff; Marjan Ghazvininejad

Self-Improving VLM Judges Without Human Annotations

Intermediate

Inna Wanyin Lin, Yushi Hu, Shuyue Stella Li et al.12/2/2025

arXiv PDF

Key Summary

•The paper shows how a vision-language model (VLM) can train itself to be a fair judge of answers about images without using any human preference labels.
•It makes “practice pairs” of good vs. worse answers by either carefully changing details (for open-ended tasks like captions) or using majority voting (for closed-ended tasks like multiple choice).
•The judge model only learns from its own judgments when it gets them right and explains why—this keeps mistakes from snowballing.
•By swapping the order of answers A/B and keeping only judgments that are correct in both orders, the method reduces positional bias (favoring the first answer just because it’s first).
•After several rounds of this loop, an 11B-parameter judge improves from about 0.38 to around 0.51–0.54 on VL-RewardBench, rivaling or beating much larger models in several areas.
•The biggest gains show up in general instruction following, hallucination detection, and VQA; safety shows little improvement because the training data did not target safety.
•Surprisingly, training with majority-vote “correct” answers can beat using gold labels, likely because it produces more and more consistent training signals.
•The framework is cheap, scalable, and works even when no ground-truth answers exist for new image domains.
•It points toward future “self-judges” that can keep up with rapidly improving VLMs without constant human re-labeling.

Why This Research Matters

This approach makes AI judging far cheaper and faster by removing the need for human preference labels, which are expensive and quickly get outdated. It helps the AI stay visually grounded, reducing hallucinations like inventing objects that aren’t in the image. Because it works without ground-truth answers, it can evaluate new kinds of images—think new camera devices, data from field robots, or unusual art styles—where labeled datasets don’t exist. That means better support for real-world tools like accessibility apps (describing scenes accurately), study helpers (checking diagram answers), and search systems (ranking better visual explanations). It also offers a blueprint for future self-updating judges that keep improving as models evolve, reducing maintenance costs. Over time, this can make multimodal AI more reliable, more transparent, and more widely available.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how teachers grade homework to help students learn what “good work” looks like? But imagine if the teacher had to write a brand-new answer key for every single new kind of question, every week—exhausting, right?

🥬 The Concept (Vision-Language Models, VLMs): A vision-language model is an AI that looks at pictures and reads words together to answer questions or describe images.

How it works (recipe):
1. See an image and read a prompt.
2. Connect what’s seen (colors, objects, positions) with what’s asked (words).
3. Produce an answer, explanation, or caption.
Why it matters: Without VLMs, AI can’t handle tasks that need both sight and language, like “What is the boy holding in this photo?” 🍞 Anchor: If you show a photo of a cat on a couch and ask “What is the animal doing?”, a VLM can say, “A cat is sitting on the couch.”

🍞 Hook: Imagine two classmates answering the same picture question—who decides which answer is better?

🥬 The Concept (AI Judge/Reward Model): A judge model is an AI referee that compares two answers about an image and decides which one is better (and why).

How it works:
1. Read the question and the two answers.
2. Check the image for facts (colors, counts, objects, positions).
3. Choose the better answer and explain the choice.
Why it matters: Without a good judge, it’s hard to train or trust other AIs, because you don’t know which answers are actually better. 🍞 Anchor: Given two captions about a skyline, a judge should prefer “tall glass skyscrapers downtown” over “short brick houses in a suburb” if the image shows skyscrapers.

The world before: Most judge models needed tons of human preference labels (people picking which answer is better). That’s expensive and slow. Plus, as AIs get smarter, yesterday’s labels can go stale—old mistakes fade, and new kinds of tasks appear.

The problem: Can we train a fair, strong VLM judge without any human preference labels at all—so it stays cheap, fast, and up-to-date?

Failed attempts: Two main routes struggled:

Scale up human labeling: Costly and quickly outdated.
Distill from big closed models (like GPT/Claude): Still indirectly depends on human labels, and you become dependent on a bigger “teacher.”

🍞 Hook: Imagine you’re practicing essays—you learn more if you see clear examples of “great” vs. “flawed” writing.

🥬 The Concept (Open- vs. Closed-Ended Tasks): Questions come in two types. Open-ended (like captions or long explanations) don’t have one exact right answer. Closed-ended (like multiple choice or numbers) do have one right answer.

How it works:
1. Open-ended: Many answers could be good; we compare quality details.
2. Closed-ended: One answer is right; others are wrong.
3. Use different strategies for each type.
Why it matters: Without separating types, you might train on weak signals (open-ended) or make trivial examples (closed-ended), and the judge won’t learn well. 🍞 Anchor: “Describe this photo” (open-ended) vs. “How many birds?” (closed-ended).

The gap: We needed a human-free way to create strong training pairs (better vs. worse answers) and a method to safely learn from them without amplifying errors.

Real stakes: Better judges mean:

Faster updates to keep up with new image domains (like new camera types or styles).
Cheaper training (no big labeling budget).
Stronger detection of hallucinations (saying things not in the picture).
Improved trust in multimodal assistants for education, accessibility, and search.

🍞 Hook: Think of a coach who makes custom drills that reveal common mistakes—and then uses those drills to teach you how to judge good form.

🥬 The Concept (Synthetic Preference Pair Generation): This is making pairs of answers where one is deliberately better than the other, without needing humans to label them.

How it works:
1. Generate answers for a question about an image.
2. For open-ended: make a “flawed” version by altering key details.
3. For closed-ended: pick the majority-vote answer as likely correct, pair it with a different answer as likely wrong.
Why it matters: Without these pairs, the judge can’t practice deciding which answer is better. 🍞 Anchor: Start with a good skyline caption; make a flawed one that wrongly says “suburban houses.” Now we know which should win.

🍞 Hook: In class, if most students agree, they’re probably right.

🥬 The Concept (Majority Voting Consensus): When you ask the model many times, the most common short answer is treated as preferred.

How it works:
1. Sample many answers.
2. Count how often each appears.
3. Pick the majority as likely best; pair it with a different answer.
Why it matters: Without a way to pick likely correct answers cheaply, we’d rely on human or external labels. 🍞 Anchor: If 10/16 model runs say “B” for a multiple-choice question, we pick “B” over a random alternative.

🍞 Hook: Detectives write down their reasoning so others can check it.

🥬 The Concept (Reasoning Traces): These are the written steps the judge uses to choose between answers.

How it works:
1. Judge explains why A or B is better.
2. We keep only cases where the judge’s final choice matches the known preferred answer.
3. The judge then learns to produce these good explanations.
Why it matters: Without reasoning traces, the judge might get the right answer for the wrong reason and won’t truly improve. 🍞 Anchor: “A is better because the photo shows tall glass towers; B wrongly says small brick houses.”

02Core Idea

🍞 Hook: Imagine making your own practice tests, knowing which answers are better, and then only studying the times you graded them correctly.

🥬 The Concept (Key Insight): The judge can train itself by creating clear better-vs.-worse answer pairs and then learning from its own correct judgments and explanations—no human preference labels needed.

How it works:
1. Make synthetic pairs: For open-ended tasks, alter important details to create a worse version; for closed-ended tasks, use majority vote to pick the likely best short answer and pair it with a different one.
2. Ask the current judge to decide which answer is better and to explain why.
3. Keep only decisions that match the known preference; also require correctness when A/B order is swapped to reduce positional bias.
4. Fine-tune the judge on these “trusted” explanations and decisions.
5. Repeat (iterate) so the judge keeps improving.
Why it matters: Without this loop, you’d need huge, expensive human-labeled datasets to teach the judge. 🍞 Anchor: Like spelling practice: you make pairs (correct vs. misspelled), check which the student picks, keep only the correct picks plus a good explanation, and study those to get better next time.

Three analogies:

Coach and drills: The coach (the framework) designs drills (synthetic pairs) that reveal common mistakes (detail swaps). The athlete (judge) practices, reviews only the well-executed reps (correct judgments with sound reasoning), and improves.
Crowd wisdom: For short answers, many tries create a “class vote.” The most frequent answer is taken as likely correct; we compare it to another answer to train the judge.
Detective training: Present two suspect stories; the trainee detective must pick the truer story and write down the logic. Only solid, consistent logic goes into the training manual.

Before vs. After:

Before: Judges relied on human preference labels or distillation from big, closed models.
After: Judges can bootstrap themselves with smart synthetic data and self-selected, high-confidence reasoning traces.

Why it works (intuition):

Constructive supervision: By making the worse answer on purpose (open-ended) or picking likely-correct answers by majority (closed-ended), we know which side should win without outside labels.
Reasoning-centered learning: Keeping only correct decisions (and in both A/B orders) filters out noise and positional bias, so the judge learns stable, image-grounded rules.
Iterative refinement: As the judge improves, it filters more data and produces clearer reasoning, making each round stronger than the last.

Building blocks (with mini “sandwiches”):

🍞 Hook: Sometimes there’s no single perfect caption—just better or worse ones. 🥬 The Concept (Detail Alteration for Open-Ended): Make a flawed answer by changing key facts (e.g., object, number, place) but keeping style/length.

How it works: Change 1–2 crucial details; keep everything else the same.
Why it matters: Without meaningful contrasts, the judge can’t practice spotting real visual mistakes. 🍞 Anchor: Turn “glass skyscrapers, 40–50 stories” into “brick houses, 10–15 stories.”

🍞 Hook: Class votes often reveal the right choice. 🥬 The Concept (Majority Voting for Closed-Ended): Sample the model many times and pick the most frequent short answer as preferred.

How it works: Generate N answers, choose the majority, pair with a different answer.
Why it matters: It creates reliable training pairs without gold labels. 🍞 Anchor: If 5+ of 16 runs pick “C,” use “C” vs. a non-“C” as the training pair.

🍞 Hook: If you always like the first cookie on a plate just because it’s first, that’s bias. 🥬 The Concept (Positional Bias Check): Ask the judge both A-vs.-B and B-vs.-A; keep pairs only if the judge still picks the same winner.

How it works: Swap order; require consistent correctness.
Why it matters: Filters out lazy “pick first” habits. 🍞 Anchor: If the judge still chooses “A” even when it’s listed second, that’s a good sign.

🍞 Hook: Writing down your thinking helps you think better next time. 🥬 The Concept (Reasoning Trace Learning): Train on the judge’s correct explanations.

How it works: Fine-tune on correct chains of thought and final decisions.
Why it matters: Builds habits of careful, visual-grounded evaluation. 🍞 Anchor: “Prefer A: building count matches image; B invents a park not seen.”

03Methodology

At a high level: Image + Question → Make synthetic answer pairs (good vs. worse) → Judge them with explanations → Keep only correct, order-robust judgments → Fine-tune on these traces → Repeat.

Step-by-step (with “why” and examples):

Collect prompts and images

What happens: Use a diverse multimodal dataset (e.g., LLaVA-OneVision) with single-image tasks across captions, reasoning, math, VQA.
Why it exists: Variety teaches the judge to handle many failure types (counts, attributes, spatial relations, OCR, etc.). Without diversity, the judge overfits.
Example: Image of a city skyline; question: “Describe this scene.” Or: “How many traffic lights are visible?”

Generate base answers

What happens: Use the base VLM to produce multiple answers per image-question.
Why it exists: We need raw material to craft both better and worse candidates. Without multiple generations, we can’t form strong contrasts.
Example: For a skyline caption, create several stylistic variants; for a numeric question, produce 16 short answers by sampling.

Construct synthetic preference pairs differently by task type

3a) Open-ended (captions/long answers) via Detail Alteration

What happens: Pick an original answer and create an altered version by changing 1–2 key facts (object type, count, color, spatial relation, time of day), keeping style/length the same.
Why it exists: Teaches the judge to prefer visually grounded truth over plausible-sounding but wrong details. Without meaningful changes, the task is too easy or not visual enough.
Example: “bustling downtown, 40–50 stories, glass facades in afternoon sun” → altered to “quiet suburb, 10–15 stories, brick facades in moonlight.”

3b) Closed-ended (short answers) via Majority Voting

What happens: Sample N answers (e.g., N=16). Take the most frequent answer as preferred, pair it with a different answer as less preferred. Require at least 5 identical votes to keep the pair.
Why it exists: Avoids trivial wrong answers and anchors on model-consistent outputs without human labels. Without the threshold, noisy majorities lower quality.
Example: For a multiple-choice, if “B” appears 7 times, we pair “B” vs. “D.”

🍞 Hook: Like picking drills that make you think hard but not get stuck. 🥬 The Concept (Quality Filtering by Design): By construction, we know which answer should win in each pair.

How it works: Open-ended: original should beat altered; closed-ended: majority should beat other choice.
Why it matters: Enables label-free training signals. 🍞 Anchor: The intact skyline caption should beat the “suburb” caption.

Judge with reasoning traces

What happens: The current judge compares A vs. B, writes a short explanation (reasoning trace), and picks a winner.
Why it exists: We need both the decision and the logic that produced it to train better future judgments. Without traces, the judge may learn shortcuts.
Example: “Pick A because it correctly states tall glass towers; B incorrectly claims 10–15-story brick buildings.”

Order-robust filtering (positional bias check)

What happens: Present both (A,B) and (B,A). Keep the sample only if the judge picks the same true winner in both orders.
Why it exists: Prevents learning to prefer the first position or other superficial cues. Without it, the judge might “get lucky” in one order.
Example: If A is truly better, the judge must pick A even when listed second.

Keep only correct judgments and traces

What happens: Retain examples where the judge’s pick matches the known preferred answer (by construction) and passes the order-swap test. Store the explanation and final decision.
Why it exists: Focuses learning on reliable reasoning patterns, not on mistakes. Without this, errors snowball.
Example: Keep “A over B” with a clear, image-grounded explanation; discard the rest.

Supervised fine-tuning (SFT)

What happens: Train the judge to reproduce the correct explanations and decisions for the kept pairs.
Why it exists: This is how the judge absorbs good habits. Without SFT, the model wouldn’t internalize improved reasoning.
Example: The judge learns to notice building height and materials when comparing captions about skylines.

Iterate until performance plateaus

What happens: Repeat: generate new pairs, judge them, filter, and fine-tune. Stop when benchmark gains shrink (<1% improvement across several rounds).
Why it exists: Each round strengthens the judge’s filters and reasoning. Without iteration, improvement stalls early.
Example: Iterations 1–4 show rising scores; later rounds give diminishing returns.

Concrete mini walk-through:

Input: Photo of a bus with “Brighton Open Top” and tall white buildings.
Step A (open-ended): Original caption says “red double-decker… tall buildings… cloudy sky.” Altered caption says “blue bus… suburban shops… overcast.”
Step B: Judge explains that the red bus and tall buildings match the image, while “blue” and “suburban” are wrong. Picks the accurate caption.
Step C: Swap order; judge still picks the accurate one. Keep this pair and its reasoning for training.

Secret sauce (what’s clever):

Labels by construction: We know which answer “should” win without any human marking.
Reasoning-focused filtering: Only learn from correct, order-robust explanations—this trims noise and bias.
Task-type-aware synthesis: Detail alteration for open-ended; majority voting for closed-ended. This avoids trivial or unhelpful pairs.
Scale without ground truth: Works even on new visual domains with no labeled answers.

04Experiments & Results

The test: Can a small (11B) judge, trained only on self-synthesized data and its own filtered reasoning traces, compete with much larger models on standard multimodal judge benchmarks?

Benchmarks and what they measure:

VL-RewardBench (VLRB): General instruction following, hallucination detection, reasoning, knowledge, safety, VQA—mix of human-verified and AI-annotated preference labels across real-world scenarios.
Multimodal RewardBench (MMRB): 5,000+ preference pairs across long-form and short answers; labels from correctness or high-quality human experts. Dimensions include general, correctness, reasoning, safety, VQA.

Competition (baselines):

Larger open/proprietary models: Llama-3.2-90B, Claude 3.5 Sonnet, GPT-4o.
Our model: Llama-3.2-11B-Vision-Instruct, iteratively improved.

Setup highlights:

Training data: 100k prompts from LLaVA-OneVision (single-image); cap 10k per sub-dataset for balance.
Training: 5 epochs/iteration, LR 1e-5, 8 GPUs (effective batch 16), FSDP; stop after small gains. About 400 GPU-hours.

Scoreboard with context:

VLRB overall: From about 0.383 (base) up to roughly 0.51–0.54 after several iterations (iteration 4 ≈ 0.538). That leap is like moving from a C+ to a solid B+/A- compared to peers.
MMRB overall: From about 0.499 to ≈ 0.539 (iteration 4), a steady improvement.
Dimension wins/gains: • VLRB General: Biggest relative jump (≈ 0.297 → 0.503), outperforming larger models (e.g., Llama-90B, Claude 3.5 Sonnet) on this slice. • VLRB Hallucination: Strong rise (≈ 0.304 → 0.514), showing better grounding in the image. • MMRB VQA: Notable boost (≈ 0.565 → 0.689), reflecting effective closed-ended training via majority voting. • VLRB Reasoning: Improves but non-monotonically; peaks around iteration 3 (≈ 0.612) before slight dip—possible overfitting. • Safety: Small or flat gains (e.g., MMRB Safety ≈ 0.317 → 0.329), likely because training didn’t include safety-focused synthetic pairs.

Surprising findings:

Majority vote can beat gold labels: For closed-ended tasks like reasoning/VQA, constructing pairs via majority voting led to better final performance than using ground-truth filtering alone. Why? Two likely reasons: (1) more and cleaner training samples per iteration (stronger signal); (2) selecting for reasoning consistency across many synthetic contrasts rather than relying on single-instance correctness.
Order-robust filtering matters: Requiring correct judgments under both A/B and B/A orders reduces positional bias and improves reasoning stability.
Iteration helps both quantity and quality: The percentage of kept training pairs rises across iterations (e.g., ~19% → ~43%), and blind checks find later iterations have better reasoning quality more often.

Interpretation:

Open-ended detail alteration trains the judge to spot visually-grounded errors (attributes, counts, spatial relations), sharpening hallucination detection and general instruction following.
Closed-ended majority voting produces high-yield, consistent supervision that transfers well to VQA-style tasks.
Some domains (e.g., NoCaps with very diverse images) remain challenging; distribution mismatch limits gains.

Bottom line:

A compact, self-trained 11B judge can reach or beat much larger models in several benchmark dimensions, while requiring no human preference labels and no external teacher.

05Discussion & Limitations

Limitations:

Safety: Minimal gains because the training pipeline didn’t intentionally create pairs involving biased/toxic content or subtle policy violations. Robust safety judging usually needs red-teaming and specialized data.
Domain shift: On very diverse image distributions (e.g., NoCaps), gains can plateau or dip—synthetic data may not cover the full visual variety.
Non-monotonic reasoning: Reasoning peaked then dipped slightly, hinting at overfitting; iteration count and data mix may need tuning per dimension.
Residual spurious reasoning: Even with filtering, some correct final choices may rely on style over facts; further checks are needed to ensure explanations stay image-grounded.

Required resources:

Base model: Llama-3.2-11B-Vision-Instruct.
Data: 100k prompts from LLaVA-OneVision (single-image subsets), balanced by capping per sub-dataset.
Compute: ~400 GPU-hours (H100s), FSDP; 5 epochs/iteration, LR 1e-5, batch 16 effective.

When NOT to use:

High-stakes safety auditing where missing edge cases is unacceptable—needs targeted safety data.
Domains with extreme visual diversity far from training images—expect weaker generalization unless you broaden synthetic image sources.
Situations where ground-truth correctness (not just preference) must be guaranteed for every instance.
Very weak base models that can’t produce sensible majority votes or realistic alterations may generate poor supervision.

Open questions:

How to safely synthesize safety-related negatives (bias/toxicity) without amplifying harm?
Can we auto-diagnose spurious reasoning and require evidence-grounded references (e.g., pointing to image regions)?
Would a mixture-of-experts of specialized judges (hallucination, OCR, spatial, counting) plus a router outperform one general judge?
How to diversify images and tasks on-the-fly to reduce domain mismatch (e.g., curriculum over image sources)?
Can confidence calibration and uncertainty-aware filtering further boost stability across iterations?

06Conclusion & Future Work

Three-sentence summary: This paper introduces a self-improving framework that trains a vision-language judge entirely without human preference labels by synthesizing better-vs.-worse answer pairs and learning only from correct, order-robust judgments and explanations. Using detail alteration for open-ended tasks and majority voting for closed-ended ones, the judge steadily improves across iterations. A compact 11B model reaches around 0.51–0.54 on VL-RewardBench and shows strong gains on hallucination, general, and VQA—sometimes rivaling far larger systems.

Main achievement: Turning self-synthesized contrasts plus reasoning-trace filtering into a practical, scalable alternative to human-preference supervision for multimodal judges, with solid benchmark results.

Future directions: Add targeted safety synthesis and red-teaming; incorporate visual evidence grounding (e.g., region pointers); expand image diversity to tackle domains like NoCaps; explore mixtures of specialized expert judges with a router; improve uncertainty-aware filtering to prevent overfitting and spurious logic.

Why remember this: It shows that a judge can teach itself to be fair and visual-grounded using only its own generated data and careful filtering—cutting costs, keeping pace with fast model progress, and opening the door to evaluation in brand-new visual domains where labels don’t exist yet.

Practical Applications

•Build automatic graders for image-based homework (e.g., science diagrams) that compare two answers and explain which is better.
•Create quality filters for photo-captioning apps so only accurate, non-hallucinated captions are shown to users.
•Develop internal evaluation pipelines for VLM updates without paying for new human preference labels each cycle.
•Power image-based help tools (e.g., accessibility readers) with a judge that flags invented details before outputting descriptions.
•Use majority-vote pairing to cheaply improve OCR/VQA answer checking in enterprise document pipelines.
•Rapidly bootstrap judges for new image domains (medical devices, industrial cameras) where ground-truth labels are scarce.
•Continuously refine moderation/judging of multimodal agents that interact with web images or user photos.
•Pre-screen training data by judging and discarding visually inconsistent captions before model fine-tuning.
•Run A/B testing on generated marketing images/captions with an internal judge ranking for factual and stylistic accuracy.
•Support robotics by judging scene descriptions and task plans for visual correctness before execution.

Version: 1