Puzzle Curriculum GRPO for Vision-Centric Reasoning

Ahmadreza Jeddi; Hakki Can Karaimer; Hue Nguyen; Zhongling Wang; Ke Zhao; Javad Rajabi; Ran Zhang; Raghav Goyal; Babak Taati; Radek Grzeszczuk

Puzzle Curriculum GRPO for Vision-Centric Reasoning

Intermediate

Ahmadreza Jeddi, Hakki Can Karaimer, Hue Nguyen et al.12/16/2025

arXiv PDF

Key Summary

•This paper teaches vision-language models to reason about pictures using puzzles instead of expensive human labels.
•It creates three self-checkable puzzle games (Jigsaw, Rotation, PatchFit) so the model can earn rewards without any teacher.
•A special training plan (curriculum) focuses learning on medium-difficulty puzzles, which are the most educational.
•The Jigsaw puzzle gives partial credit, so the model learns from almost-right answers instead of only perfect ones.
•They track Reasoning–Answer Consistency (RAC) to check if the model’s final answer matches its own step-by-step thinking.
•RAC usually rises early but then falls with standard GRPO; the puzzle curriculum slows this drop and improves stability.
•Across many visual reasoning benchmarks, the method improves accuracy and training behavior on 3B and 7B models.
•The approach also reveals that popular benchmarks contain lots of noisy or unclear questions, which they help clean.
•The result is a practical, scalable, and interpretable recipe for reinforcement learning in vision without human annotations.

Why This Research Matters

This work shows we can grow visual reasoning skills in AI without paying for massive amounts of human labels. By turning images into self-checkable puzzles, training becomes cheaper, cleaner, and easier to scale. The curriculum makes learning efficient by focusing on puzzles that teach the most, not those that are trivial or impossible. Tracking consistency between reasoning and answers builds trust and makes systems easier to debug and improve. In practice, this means better assistants for education, accessibility, science, and everyday tasks that depend on understanding what’s in a picture.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re learning to ride a bike. If a grown-up has to hold you the whole time and tell you every move, you learn slowly and it’s expensive for them. But if you can practice safely in a playground with clear signals—like lines on the ground and gentle slopes—you can learn a lot on your own.

🥬 The Concept (Vision-Language Models, VLMs): What it is: Vision-Language Models are AIs that look at images and talk about them. How it works: they see a picture (vision) and read/write text (language) to answer questions or explain scenes. Why it matters: Without good training, they may miss details, guess, or contradict themselves. 🍞 Anchor: A VLM can look at a photo of a kitchen and answer “How many cups are on the table?”

🍞 Hook: You know how coaches set up drills that reward you each time you kick the ball better? That’s like learning from feedback, not from being told the answer.

🥬 The Concept (Reinforcement Learning, RL): What it is: RL is a way for AI to learn by trying things and getting rewards for doing well. How it works: 1) Try an action, 2) receive a reward or no reward, 3) adjust behavior to get more rewards next time. Why it matters: Without rewards, the AI doesn’t know what choices are good. 🍞 Anchor: A model that gets a gold star each time it correctly spots a cat in a photo learns to do it more often.

🍞 Hook: When you solve a picture puzzle, you’re doing visual reasoning—using clues like shapes, positions, and colors.

🥬 The Concept (Visual Reasoning): What it is: Thinking carefully about what’s in an image and how parts relate. How it works: read visual features (edges, colors), connect them (the round thing is a ball), and combine facts (the ball is under the table). Why it matters: Without visual reasoning, an AI can’t answer grounded questions like “Which window is open?” 🍞 Anchor: Looking at a street photo and deciding which car is parked closer to the tree.

🍞 Hook: When your teacher asks you to “show your work,” you write down your steps so your answer makes sense.

🥬 The Concept (Chain-of-Thought, CoT): What it is: Chain-of-thought is the AI writing its step-by-step reasoning. How it works: it generates a short plan or rationale before the final answer. Why it matters: Without CoT, we can’t see why the AI decided something, and it’s harder to fix mistakes. 🍞 Anchor: “First, I count the apples in the top row (3). Next, I count the bottom row (2). So total apples = 5.”

🍞 Hook: Think of a game where you only get a point if you are exactly right—no half credit. That’s tough and not very helpful for learning.

🥬 The Concept (Reward Sparsity): What it is: Reward sparsity happens when the model gets only all-or-nothing feedback. How it works: most tries return zero, so there’s little signal to learn from. Why it matters: Without smaller, guiding rewards, training stalls or becomes unstable. 🍞 Anchor: A spelling quiz that gives 0 points unless a word is perfectly correct—hard to improve from that.

🍞 Hook: If everyone in a race runs either super slowly or super fast, comparing small differences won’t teach you much about pacing.

🥬 The Concept (GRPO, Group-Relative Policy Optimization): What it is: GRPO is a training method where multiple tries for the same input are compared to each other to decide which tokens to reinforce. How it works: generate several answers, score them, and nudge the model toward the relatively better ones. Why it matters: Without good spread (not too easy or too hard), the “relative” advantage shrinks and learning weakens. 🍞 Anchor: Comparing 8 different attempts to solve the same puzzle, then copying the style of the best-performing attempt.

🍞 Hook: Imagine grading homework that you can check automatically—like a lock that opens only with the right code.

🥬 The Concept (RL with Verifiable Rewards, RLVR): What it is: RLVR means using tasks where the correctness of an answer can be programmatically checked. How it works: set up puzzles with clear right/wrong checks or partial-credit rules; the computer can grade instantly. Why it matters: Without verifiable tasks, you’d need lots of humans or external judges, which is costly and noisy. 🍞 Anchor: A rotation task where the program knows the image was turned 90°, so it can check if the model said 90°.

🍞 Hook: Sometimes your explanation says one thing, but your final answer says another—like writing “3+4=7” but then circling “8.”

🥬 The Concept (Reasoning–Answer Inconsistency): What it is: When the model’s explanation and final answer don’t match. How it works: the rationale supports answer A, but the final answer is B. Why it matters: It breaks trust and makes the model less useful, even if the score looks fine. 🍞 Anchor: The model writes, “I see three red cups and two blue cups,” then answers “4 cups total.”

The world before: VLMs got better with instruction tuning and RL. But for vision, rewards were hard to verify without costly labels or third-party judges. Binary rewards were sparse; GRPO updates went flat on very easy or very hard items. And chain-of-thought often drifted from the final answer over training.

The problem: How can we 1) avoid expensive labels/judges, 2) reduce sparse/flat rewards, and 3) keep reasoning consistent with final answers?

What people tried: Supervised fine-tuning with human labels, critic/judge models to verify answers, vanilla GRPO on limited puzzles. These helped but were expensive, added bias/noise, and didn’t fix flat or sparse rewards—or growing inconsistency.

The gap: A supervision-free, verifiable, curriculum-based RL recipe that improves visual reasoning, stabilizes training, and boosts reasoning–answer faithfulness.

Real stakes: Better household robots, safer car assistants, smarter accessibility tools, clearer tutoring systems—and less wasted money on labels. If the model thinks what it says, and we can cheaply verify learning, everyone wins.

02Core Idea

🍞 Hook: Picture a school where students learn by solving self-checking puzzles. The teacher doesn’t grade every page; the puzzles give points automatically, and the coach makes sure drills are not too easy or too hard.

🥬 The Concept (PC-GRPO): What it is: Puzzle Curriculum GRPO is a supervision-free training recipe that teaches VLMs visual reasoning using self-graded puzzles, a difficulty-aware curriculum, and consistency monitoring. How it works: 1) Replace labels with three verifiable puzzles, 2) weight training toward medium-difficulty cases, 3) track reasoning–answer consistency over time and optionally add a light consistency bonus. Why it matters: Without this trio, training is costly, rewards are too sparse or flat, and the model’s explanation can contradict its answer. 🍞 Anchor: The model learns with Jigsaw (partial credit), Rotation (binary check), and PatchFit (binary check) while the curriculum keeps it in the “goldilocks” zone of difficulty.

The “Aha!” Moment in one sentence: If you turn vision reasoning into self-checkable puzzles and focus training on medium difficulty while watching consistency, you can grow strong reasoning without human labels or external judges.

Multiple analogies:

Playground analogy: The puzzles are safe playgrounds where falls are small and feedback is instant; the coach (curriculum) keeps the games engaging but not frustrating; a mirror (RAC) checks that your story matches your moves.
Music practice analogy: Etudes (puzzles) drill specific skills with immediate scoring; a smart metronome (curriculum) keeps tempo challenging but achievable; recording yourself (RAC) ensures your explanation of technique matches what you actually played.
Cooking class analogy: Recipes (puzzles) have built-in taste tests (verifiable checks); the chef (curriculum) serves tasks that suit your current skill; a note card (RAC) makes sure your written steps match the dish on the plate.

Before vs. After:

Before: Expensive, noisy labels; binary rewards that rarely guide partial progress; training updates flatten on too-easy/too-hard items; explanations drift from answers.
After: No labels needed; graded partial credit from Jigsaw rewards small wins; curriculum emphasizes medium hardness; RAC monitoring improves faithfulness and helps pick better checkpoints.

🍞 Hook: You know how quizzes that give partial credit help you learn faster because they tell you what parts you did right?

🥬 The Concept (Graded Partial Credit): What it is: A scoring rule that gives points for parts you got right. How it works: in Jigsaw, each correctly placed tile earns credit; the score is the fraction of correct tiles. Why it matters: Without partial credit, the model sees lots of zeros, can’t learn steady improvements, and training becomes unstable. 🍞 Anchor: If you place 4 out of 9 tiles correctly, you still get 4/9 credit, which nudges you closer next time.

🍞 Hook: Practice that is way too easy is boring; way too hard is discouraging. The sweet spot teaches best.

🥬 The Concept (Difficulty-Aware Curriculum): What it is: A training plan that gives higher weight to medium-difficulty examples. How it works: measure how spread-out a group’s results are (or how diverse their solutions are for Jigsaw) and give a bell-shaped weight that peaks in the middle. Why it matters: Without this, updates go flat because very easy or very hard items provide little relative signal. 🍞 Anchor: The system automatically spends more time on “just-right” puzzles, not on trivial or impossible ones.

🍞 Hook: If you say, “I counted 3 wheels,” your final answer shouldn’t say “2 wheels.”

🥬 The Concept (Reasoning–Answer Consistency, RAC): What it is: A metric that checks if the explanation truly supports the final answer. How it works: a fixed expert judge model reads both and votes 1 for consistent, 0 for inconsistent; the team tracks RAC across training. Why it matters: Without RAC monitoring, you may think you’re improving (rewards go up) while faithfulness goes down. 🍞 Anchor: RAC helps pick better checkpoints because higher RAC often matches better real-world accuracy.

Why it works (intuition, no equations):

Verifiable puzzles remove label cost and reduce noise because the environment itself checks correctness.
Partial credit turns big learning jumps into many small steps, which is friendlier to GRPO.
Curriculum emphasizes the informative middle, preventing vanishing advantages when groups are too uniform.
RAC monitoring keeps explanations aligned with answers, making the model more trustworthy.

Building blocks:

Self-supervised puzzle trio: Jigsaw (partial credit), Rotation (binary), PatchFit (binary).
Group-based training with GRPO but reweighted by difficulty.
Consistency monitoring (and an optional lightweight consistency bonus) to align thoughts with answers.

03Methodology

High-level overview: Input image → turn it into a puzzle (Jigsaw/Rotation/PatchFit) → model generates several think-and-answer attempts → environment scores each attempt (partial or binary) → compute difficulty and a curriculum weight for the group → apply a GRPO-style update that emphasizes better attempts and medium-difficulty groups → periodically measure RAC across training to watch faithfulness.

🍞 Hook: Imagine three game stations in a gym class—each auto-scores your try. The coach then decides which station to spend more time on based on how challenging it felt for the class.

🥬 The Concept (Self-Supervised Puzzle Environments): What it is: Automatically graded visual tasks created from ordinary images. How it works: the trainer cuts, rotates, or masks patches; the model proposes a solution; the environment checks it. Why it matters: Without these stations, we would need people or external judges to grade every answer. 🍞 Anchor: From a COCO photo, we make a 3×3 Jigsaw; the model proposes a tile order, and the computer counts how many tiles are in the right spots.

Step-by-step recipe:

Make puzzles from images

Jigsaw: cut the image into a small grid (e.g., up to 3×3). The model must place each tile back in its correct position. Scoring: graded—fraction correct.
Rotation: rotate the image by one of a few angles (0, 90, 180, 270 degrees). The model predicts which. Scoring: binary—1 if correct, 0 otherwise.
PatchFit: mask a patch; show several similar-looking candidate patches; the model picks the true one. Scoring: binary—1 if correct, 0 otherwise. Why this step exists: transforms any dataset into a self-checking gym for visual reasoning. Example: A beach photo becomes a 2×3 Jigsaw; the environment knows the answer so it can grade instantly.

Generate multiple attempts (rollouts) per puzzle What happens: for each puzzle prompt, the model produces G different think→answer solutions (e.g., G=8), each with a reward from the environment. Why this step exists: GRPO compares attempts within the same group; without multiple attempts, there’s no relative advantage signal. Example: For a Rotation puzzle, the 8 answers might split between 90° and 180°; only the correct angle gets reward 1.
Compute difficulty and curriculum weight What happens: the system estimates how hard the group was, then uses a bell-shaped weight that peaks at medium difficulty.

For binary tasks (Rotation, PatchFit): difficulty comes from the group’s average success rate—too high (easy) or too low (hard) get low weight; middle gets high weight.
For Jigsaw (graded): many different permutations can earn the same score, so difficulty looks at solution diversity across attempts; more diversity indicates a more informative (medium-hard) group. Why this step exists: without difficulty-aware weights, GRPO updates go flat on trivial or impossible items. Example: If every attempt says 0° rotation (all wrong), that’s too hard: weight is low; if attempts are split and varied, weight is high.

Apply GRPO-style update with the curriculum weight What happens: the training algorithm boosts tokens from better-scoring attempts, but scales the whole group by the curriculum weight so that medium-difficulty groups influence learning the most. Why this step exists: combines relative advantages (which attempt did better) with task informativeness (is this a useful training moment?). Example: Two Jigsaw groups get the same average score, but one has diverse permutations (more informative). The diverse group gets a higher weight and shapes the model more.
Monitor Reasoning–Answer Consistency (RAC) over time What happens: at checkpoints, a fixed large VLM judge reads the model’s rationale and final answer and votes if they agree. Plot RAC across training. Why this step exists: rewards can rise even while faithfulness drops; RAC warns you when that happens and helps you pick better checkpoints. Example: RAC climbs early, then begins to fall late with vanilla GRPO; adding curriculum and a small consistency bonus keeps RAC higher for longer.

Concrete mini-examples with data:

Jigsaw partial credit: Random tile orders average around 26% correct. After training, the model’s average per-sample Jigsaw score rises steadily (like moving from 0.26 to well above that), indicating learning from partial wins.
Rotation accuracy: From a 25% random baseline (four angles), the trained model exceeds chance and transfers this skill to certain spatial benchmarks.
PatchFit choices: With several look-alike patches, chance is low; correct picks jump under the curriculum+consistency recipe, though transfer to other tasks is weaker.

The secret sauce:

Partial credit in Jigsaw turns sparse, all-or-nothing learning into a smooth ladder of progress.
Difficulty-aware curriculum maximizes signal where it matters—medium difficulty—reducing flat or vanishing advantages.
RAC monitoring (and an optional lightweight consistency bonus) keeps the model honest: the story it tells should match the answer it gives.

What breaks without each step:

Without verifiable puzzles: you need costly humans or noisy external judges.
Without partial credit: learning stalls on near-misses.
Without curriculum: updates flatten on too-easy/too-hard items; instability grows.
Without RAC: you might select a late checkpoint with high puzzle reward but worse real-world reasoning faithfulness.

04Experiments & Results

🍞 Hook: Think of a school report card that not only shows grades but also whether students’ explanations actually match their final answers.

🥬 The Concept (The Test): What it is: The team trains on COCO images turned into puzzles, then tests on many public vision-reasoning benchmarks. How it works: compare PC-GRPO to strong baselines on accuracy and track RAC during training. Why it matters: We need to know if puzzle practice transfers to real tasks and if consistent reasoning predicts better results. 🍞 Anchor: After puzzle practice, the model answers typical benchmark questions like “Which object is closest to the door?”

What they measured and why:

Puzzle performance: to see if the model really learned each puzzle and whether skills transfer across puzzles (Jigsaw → Rotation, etc.).
Benchmark accuracy: to check real-world visual reasoning improvements over Qwen baselines and recent RL methods.
RAC over time: to see if explanations stay aligned with answers and whether higher RAC correlates with accuracy.

Competition (baselines):

Qwen-VL-2.5 base, plus recent supervision-free and RL variants (e.g., Visual Jigsaw, VisualSphinx, Vision-Zero, ViCrit) and GRPO-CARE.

Scoreboard with context:

Across diverse benchmarks, PC-GRPO variants improve over the base model and are competitive with or better than other annotation-free methods. Think of it like moving from a class average of B- to B+/A- across many subjects.
Rotation-trained models often shine on spatial and perceptual tasks; Jigsaw-trained models benefit from partial credit and help stabilize reasoning.
PatchFit alone is tough and transfers less, but mixing all puzzles gives the most balanced gains.
RAC patterns: Vanilla GRPO shows RAC rising early then falling. Adding the curriculum slows this fall; adding a light consistency bonus (CARE) raises RAC further. Higher RAC often matches better downstream accuracy, and late checkpoints aren’t always best.

Surprising findings:

Inter-puzzle transfer is limited: practicing only Jigsaw doesn’t automatically raise Rotation skill, and vice versa. Mixing puzzles helps.
Benchmark noise is real: around 10–20% of items in some popular sets are mislabeled or underspecified. After cleaning (using a careful VLM committee for auditing), many models—including PC-GRPO—score higher, confirming that noise hid real progress.

Concrete numbers (made friendly):

Chance levels: Rotation random ≈ 25%, Jigsaw random partial credit ≈ 26%. PC-GRPO rises clearly above these baselines after training.
On large 7B backbones, curriculum+consistency recipes often outperform other label-free methods across multiple benchmarks, and are competitive with methods trained using human annotations.

Takeaways:

Self-graded puzzles plus a medium-difficulty focus produce steadier learning and better reasoning.
Keeping an eye on RAC helps avoid picking misleading final checkpoints.
Mixed-puzzle training is the safest bet for broad transfer.

05Discussion & Limitations

🍞 Hook: Training an athlete only on push-ups won’t guarantee a faster sprint; you need the right mix of drills, the right difficulty, and honest feedback.

🥬 The Concept (Limitations): What it is: Realistic boundaries on what PC-GRPO can and cannot do today. How it works: identify where transfer is weak, where signals are proxy-based, and what resources are needed. Why it matters: Knowing limits helps improve the next version and avoid misuse. 🍞 Anchor: If you only practice Jigsaws, don’t expect instant mastery of Rotation; mix drills for balanced growth.

Specific limitations:

Limited inter-puzzle transfer: Skills learned in one puzzle don’t always help another; mixing puzzles reduces but doesn’t erase this gap.
Over-optimization risk: Rewards can keep rising while RAC falls; a very late checkpoint may be worse than a mid-training one.
Judge dependence: RAC uses a fixed large VLM judge; if the judge is biased, RAC can be noisy (the paper uses a strong open-source judge to reduce this).
Task coverage: Puzzles target perception and spatial reasoning but not every real-world skill (e.g., long video understanding or complex dialogues).
Compute/time: While label-free, training still uses significant compute for multi-try rollouts and large backbones.

Required resources:

A capable VLM backbone (e.g., Qwen-VL-2.5 3B/7B), GPU compute for GRPO rollouts, and the puzzle-generation toolkit.
Optional: the consistency-aware add-on (CARE) and a fixed judge for RAC.

When not to use:

If you already have high-quality, task-specific human labels that directly measure your target skill better than puzzles.
If your application demands skills far from these puzzles (e.g., long-form video narration) and you can’t adapt the puzzles accordingly.
If you need guaranteed monotonic gains late in training without checkpoint selection.

Open questions:

Can we design new puzzles that transfer better to more tasks (e.g., temporal puzzles for video, geometry puzzles for layout reasoning)?
Can we formalize checkpoint selection using RAC (or other signals) in a principled, automated way?
How do we reduce reliance on a large judge for RAC while keeping faithfulness high?
Can we predict which puzzle mix best fits a target benchmark without brute-force trials?

06Conclusion & Future Work

Three-sentence summary: This paper shows how to train vision-language models to reason about images using self-graded puzzles, a difficulty-aware curriculum, and consistency monitoring—without human labels or external verifiers. Partial credit from Jigsaw plus medium-difficulty weighting makes GRPO updates more informative and stable, while RAC monitoring keeps the model’s explanations aligned with its answers. The result is better accuracy, more faithful reasoning, and a practical, scalable path to vision-centric RL post-training.

Main achievement: A unified, supervision-free RLVR recipe—PC-GRPO—that replaces expensive labels with verifiable puzzles, fixes flat/sparse reward dynamics with a smart curriculum, and improves faithfulness via RAC.

Future directions: Invent richer puzzles that transfer to more tasks (including video), automate checkpoint selection with smarter signals, lighten RAC judging costs, and expand benchmark cleaning to reduce evaluation noise.

Why remember this: It’s a blueprint for growing visual reasoning in a way that’s scalable, verifiable, and interpretable—teaching models to think about what they see, explain their steps, and have their answers match their reasoning, all without a human constantly looking over their shoulder.

Practical Applications

•Build vision tutors that explain step-by-step how they counted objects or compared positions in a scene.
•Create safer driver-assist systems that reason about road scenes with explanations that match their decisions.
•Improve accessibility tools that describe images consistently for users with low vision.
•Accelerate robotics training by using self-checkable visual puzzles instead of expensive human annotations.
•Pre-train medical or industrial inspection models with puzzle-like pretext tasks to sharpen visual reasoning.
•Clean noisy benchmarks automatically by auditing labels with a committee of strong VLMs.
•Select better model checkpoints by monitoring reasoning–answer consistency instead of reward alone.
•Design domain-specific puzzles (e.g., floorplan jigsaws) to transfer skills to architecture or indoor navigation.
•Enhance educational apps that teach kids visual logic through graded puzzles with immediate feedback.

Version: 1