PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

Jianshu Zhang; Chengxuan Qian; Haosen Sun; Haoran Lu; Dingcheng Wang; Letian Xue; Han Liu

PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

Intermediate

Jianshu Zhang, Chengxuan Qian, Haosen Sun et al.1/21/2026

arXiv PDF

Key Summary

•This paper asks a new question for vision-language models: not just 'What do you see?' but 'How far along is the task right now?'
•The authors build PROGRESS-BENCH, a fair test that gives a short demo of a robot task and one picture, then asks the model to estimate percent complete or say N/A if it cannot know.
•They find most models do poorly, especially when the camera view changes or when the demo is only text, and many fail to say N/A when the question is unanswerable.
•They introduce a human-like, two-stage reasoning method: first retrieve a similar moment from the demo (episodic retrieval), then imagine the small changes from that moment to now (mental simulation).
•Training-free prompting that forces this structure helps a little on big models but not on small ones.
•Training with a new 45K-sample dataset (25K supervised + 20K RL) creates PROGRESSLM-3B, which is much better calibrated and more robust than its base model.
•The trained model keeps good ordering of progress along a task and avoids making up answers on unanswerable cases.
•Vision demos beat text demos because text requires 'implicit state accumulation'—remembering what has already happened between steps.
•Viewpoint changes are hard for most models, but PROGRESSLM trained with the two-stage method handles them more gracefully.
•This work turns progress estimation into an explicit, interpretable reasoning skill instead of a black-box guess.

Why This Research Matters

Robots and AI helpers must know not only what is present but how far a job has progressed to act safely and efficiently. A dishwasher-loading robot needs to stop at the right time; a warehouse robot must hand off items only when placement is complete. This paper shows a practical way to teach that skill: first find a similar moment from a demo, then make a small, explainable adjustment. It also adds an honesty check—saying N/A when information is inconsistent—which is crucial for safety. By building a fair test and a training method that works even for small models, this work pushes AI from static snapshots toward process understanding. That shift unlocks better human–robot teaming, smarter monitoring, and smoother automation in real settings.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how when you watch someone bake a cake, you can tell if they're mixing, pouring, or almost done—even from a single snapshot? Your brain tracks 'how far along' the job is, not just what the picture shows.

🥬 Filling (The Actual Concept): Progress reasoning is judging how far a task has gone based on what you see right now. It’s different from naming things; it’s about tracking changes over time.

What it is: Estimating a percent-complete score for an ongoing task from one observation (like a single photo) plus a short demonstration of how the task should unfold.
How it works: You compare the current moment to the demo, figure out roughly where you are, and then fine-tune that guess by reasoning about small differences.
Why it matters: Without it, an AI can say 'That’s a plate' but not 'The plate is 70% of the way into the rack.' It can’t plan, monitor, or stop at the right time.

🍞 Bottom Bread (Anchor): Imagine a robot putting a book on a shelf. A single snapshot might show the book halfway in. Progress reasoning says, 'About 50% done,' not just 'book and shelf present.'

🍞 Top Bread (Hook): Imagine two kinds of instructions: a picture recipe with photos vs. a text recipe with steps. Which is easier to follow at a glance? Photos!

🥬 Filling (The Actual Concept): Demonstration modalities shape how much help a model gets.

What it is: Vision-based demos show key images with progress marks (0%, 25%, …, 100%), while text-based demos list step-by-step actions with progress labels.
How it works: For vision-based, you match the current photo to a similar demo frame. For text-based, you read an action list and mentally track how the world should have changed.
Why it matters: Text needs 'implicit state accumulation'—remembering what changed between steps—so it’s harder and easier to confuse similar-sounding steps.

🍞 Bottom Bread (Anchor): 'Lift plate' (Step 3) and 'cover pot' (Step 4) sound similar if you only see hands and a lid—but if the pumpkin is already on the plate, you must be at Step 4, not Step 1.

🍞 Top Bread (Hook): Think of watching a soccer game from different seats—front row vs. corner seat. It’s the same play, but it looks different.

🥬 Filling (The Actual Concept): Viewpoint correspondence checks if models cope with camera changes.

What it is: Same-view means the demo and the observation share the camera angle; cross-view means they don’t.
How it works: In same-view, matching is easy. In cross-view, you must rely on object relations, not pixel look-alikes.
Why it matters: If a model can’t handle cross-view, it overfits to appearances and fails in real life where cameras move.

🍞 Bottom Bread (Anchor): A block on a cube from the left view still means 'almost stacked' even if a top-down view looks different.

🍞 Top Bread (Hook): Sometimes a riddle has missing clues, and the honest answer is 'I don’t know.'

🥬 Filling (The Actual Concept): Answerability means knowing when to abstain.

What it is: Some observation–demo pairs don’t match (different objects, edited scenes), so progress can’t be defined.
How it works: If the observation conflicts with the demo, the right output is N/A instead of any number.
Why it matters: Without this, models make up confident but wrong numbers—unsafe for robots.

🍞 Bottom Bread (Anchor): If the demo is 'stack the blue block on the pink block' but the photo shows a white cylinder, the model should answer N/A.

The world before: Vision-language models (VLMs) were great at 'spot the noun' ('That’s a cup') and even at simple reasoning about visible scenes. But they didn’t handle 'how far along is this task?' which needs long-horizon thinking. Past attempts either trained task-specific regressors or sidestepped progress by ranking shuffled frames or doing pairwise comparisons. Those didn’t teach the model to estimate a single percent from just one observation.

The problem: Can VLMs learn to estimate progress from a single observation and a short demo—robustly across vision vs. text demos, changed viewpoints, and unanswerable cases?

Failed attempts: Direct prediction (just output a number) tended to collapse (e.g., always 0%, 50%, or 100%), got confused by camera changes, and missed unanswerable cases.

The gap: A benchmark and a teaching method for true progress reasoning—explicit, interpretable, and robust.

Real stakes: In homes, factories, and hospitals, knowing when a step is 20% vs. 90% done means better timing, safer handoffs, and less wasted motion. It’s how robots coordinate with people and stop at just the right moment.

02Core Idea

🍞 Top Bread (Hook): You know how you find a matching page in a comic and then imagine the few panels between what you see and what should happen next? First match, then imagine.

🥬 Filling (The Actual Concept): The key insight is a human-like, two-stage recipe: episodic retrieval (find the closest demo moment) followed by mental simulation (imagine the small transition to now) to estimate progress.

What it is: A coarse-to-fine process—first anchor to a similar step, then fine-tune the percent.
How it works: (1) Search demo steps for the one most like the current state. (2) Compare subtle differences (object pose, gripper contact, distance to goal) to nudge the score up or down. (3) Output a calibrated percent or N/A if nothing matches.
Why it matters: Without an anchor, models guess coarsely or collapse to defaults. Anchoring keeps estimates smooth, ordered, and explainable.

🍞 Bottom Bread (Anchor): If the demo’s Step 4 shows the plate just above the rack and your photo shows it slightly higher, you say 'a bit more than Step 4’s 60%—maybe 68%.'

Three analogies:

Trail map: First find your nearest mile marker (episodic retrieval), then pace out the extra yards (mental simulation).
Song progress: Match the current chorus to the demo (retrieval), then count a few more beats to place the timestamp (simulation).
Jigsaw puzzle: Place a piece near the right area (retrieval), then rotate slightly to fit perfectly (simulation).

Before vs. after:

Before: Direct prediction often clustered at extremes, broke under camera shifts, and ignored N/A.
After: With the two-stage method plus training, models produce smooth scores, keep correct step orderings, and abstain when mismatched.

Why it works (intuition):

Anchors reduce the search space: comparing to one nearby step is easier than scanning the whole task timeline.
Local reasoning is simpler: small visual differences map to small score changes, which stabilizes calibration.
Coupling strengthens context: a good anchor guides better simulation; better simulation validates the anchor—forming a helpful feedback loop.

Building blocks (each with a mini sandwich):

🍞 Hook: Think of flipping through a picture book to find the most similar scene. 🥬 Episodic Retrieval: Find the demo step that best matches the current image. It narrows the timeline so you don’t guess wildly. 🍞 Anchor: Pick demo Step 5 (80%) because the plate is already aligned with the rack like in your photo.
🍞 Hook: Imagine a few extra frames between scenes in your head. 🥬 Mental Simulation: From the chosen step, imagine how the state changes to match the photo, then adjust the percent slightly. Without it, you’d be stuck at the coarse step. 🍞 Anchor: If Step 5 is 80% and your photo shows the plate just a tad further in, predict 84%.
🍞 Hook: Sometimes the right move is to say 'I can’t tell.' 🥬 Answerability: If no step matches (e.g., wrong object), output N/A. Without this, you risk confident nonsense. 🍞 Anchor: Bowl demo but cup photo → N/A.
🍞 Hook: Following a recipe card improves your cooking. 🥬 Training-Free Prompting: A structured output format (<ref_think>, <ref>, <score_think>, <score>) nudges models to reason in two stages. Helps big models a bit but small ones may imitate format without real gains. 🍞 Anchor: Model writes which step it matched, explains differences, then gives a score.
🍞 Hook: Practice plus coaching beats guessing. 🥬 Training-Based Learning: Use 25K supervised CoT samples to teach the pattern, then 20K RL samples to calibrate scores and structure. This locks in robust, generalizable behavior. 🍞 Anchor: After training, the model’s scores spread smoothly across 0–100% rather than spiking at a few default values.

03Methodology

High-level pipeline: Input (Demo + One Observation) → Stage 1: Episodic Retrieval → Stage 2: Mental Simulation → Output (Percent or N/A).

Part A: Building the benchmark (PROGRESS-BENCH) 🍞 Hook: A fair test is like a well-marked obstacle course—you learn what really trips you up. 🥬 PROGRESS-BENCH: A dataset to test single-observation progress estimation.

What it is: About 3,325 questions from ∼240 robot task trajectories, each with a short demo (vision or text) and one image to score.
How it works:
1. Demonstrations: either key frames with progress labels (vision) or action steps with labels (text).
2. Observations: sample intermediate frames between demo steps; assign ground-truth progress by interpolation.
3. Viewpoints: same-view or cross-view to test robustness.
4. Answerability: create mismatches (edited images or swapped text objects) so the right answer is N/A.
Why it matters: Separates perception from timeline reasoning and uncertainty awareness. 🍞 Anchor: Demo shows 0–25–50–75–100% frames; the test image might sit at 62%—can you tell?

Part B: Metrics that capture accuracy, order, and honesty 🍞 Hook: Grading a race time needs precision, ordering, and knowing when a clock broke. 🥬 Metrics:

Normalized Score Error (NSE): What it is—single-point percent error scaled to progress range. Why—tests calibration at each sample. Anchor—predict 60% vs. truth 50% yields a relative error.
Progress Rank Correlation (PRC): What it is—does the order of your predicted scores along a trajectory match the true order? Why—progress should rise smoothly. Anchor—if your scores zig-zag wildly, PRC drops.
Answerable False Rejection Rate (AFRR): What it is—how often you wrongly say N/A on answerable cases. Why—being too timid misses valid answers. Anchor—if many normal cases get N/A, AFRR is high.
Unanswerable Detection Accuracy (UDA): What it is—how often you correctly say N/A on unanswerable cases. Why—safety and honesty. Anchor—mismatched object? Say N/A.

Part C: Training-free two-stage prompting 🍞 Hook: A worksheet that says 'show your work' can improve focus. 🥬 What it is: Force the model to output four fields: <ref_think> (why this step), <ref> (which step), <score_think> (fine-grained comparison), <score> (final percent or N/A).

How it works: The structure nudges coarse-to-fine reasoning without changing model weights.
Why it matters: Helps some big models preserve ordering and slightly reduce error; small ones may just mimic the format. 🍞 Anchor: 'Step 3 is closest because the plate is lifted; compared to Step 3 it’s a bit higher → 58%.'

Part D: Training-based PROGRESSLM-45K and PROGRESSLM-3B 🍞 Hook: Practice drills plus a coach’s feedback make athletes consistent under pressure. 🥬 Supervised Fine-Tuning (SFT) with Chain-of-Thought (25K):

What it is: Teach the model to produce the two-stage reasoning, given ground-truth <ref> and <score>, and to generate the missing thinking fields coherently.
How it works: Autoregressive training to imitate correct, structured reasoning traces across tasks disjoint from the benchmark.
Why it matters: Internalizes the discipline of anchoring then adjusting, avoiding collapse to a few default scores. 🍞 Anchor: The model learns to say 'closest to Step 4 (75%), slightly behind → 70%' with a clear explanation.

🍞 Hook: A coach then tunes you for game-day decisions, not just drills. 🥬 Reinforcement Learning (20K) with GRPO-style rewards:

What it is: Optimize for three rewards—structured formatting, accurate reference retrieval, and precise scoring.
How it works: Generate multiple answers, score them with the reward mix (α:β:γ = 1:6:3), and update the policy to favor better anchors and calibrated percents.
Why it matters: Tightens calibration, improves cross-view robustness, and boosts N/A honesty. 🍞 Anchor: The model gets more credit when it picks the right step anchor and outputs a plausible percent—over time, this sharpens both.

Part E: Why this method is clever (the secret sauce)

It turns a hard global guess into an easy local adjustment.
It couples two useful checks—'find a step' and 'nudge the percent'—so each supports the other.
It makes reasoning interpretable: you can read why a given percent was chosen.
It separates 'can answer' from 'cannot answer' explicitly, improving safety.

Concrete walkthrough:

Input: Text demo (6 steps) for 'place a plate in a rack' + one photo.
Step A (Retrieval): Choose Step 3 ('lift the plate') because the photo shows the plate above the rack.
Step B (Simulation): In the photo the plate is slightly higher than in Step 3’s mental image, so raise progress from 50% to 58%.
Output: <ref>=3, <score>=58% (or N/A if nothing matches).

04Experiments & Results

🍞 Hook: Imagine a science fair where everyone solves the same puzzles. Some teams rush and guess; others show neat, step-by-step work.

🥬 The Test: Evaluate 14 vision-language models on PROGRESS-BENCH to see who can estimate task progress from a single observation.

What they measured:
1. NSE (single-sample accuracy),
2. PRC (ordering along a task),
3. AFRR (don’t wrongly abstain),
4. UDA (do abstain when needed).
Why: Progress needs precise numbers, correct ordering over time, and honesty about uncertainty.

The competition:

Strong general models (e.g., GPT-5, large Qwen2.5-VL, Intern3.5-VL variants).
Base 3B model vs. PROGRESSLM-3B (SFT then RL) trained on disjoint tasks (to avoid memorization).

Scoreboard with context:

Direct prediction struggled: Many models produced spiky score distributions (like guessing only 0%, 50%, 100%), resulting in poor NSE and even negative/undefined PRC on some settings—like getting the order of steps wrong.
Training-free prompting (two-stage format) helped large models preserve ordering (higher PRC) and sometimes reduce NSE, but small models often saw little or negative gains—like copying the worksheet format without understanding.
Training-based PROGRESSLM-3B shined: Despite being small (3B), SFT improved structure and RL further tightened calibration. PROGRESSLM-3B-RL reached top or near-top macro averages for NSE/PRC and showed strong UDA without inflating AFRR—like consistently getting A-range scores with neat work shown.
Vision vs. text: Vision demos were easier—text requires 'implicit state accumulation,' which trips many models. PROGRESSLM reduced this gap but didn’t erase it (a natural difficulty).
Same-view vs. cross-view: Most models dropped in cross-view (camera change). PROGRESSLM-3B-RL had smaller drops—showing viewpoint-robust anchoring.
Unanswerable cases: Many models still guessed a number. PROGRESSLM recognized N/A far more reliably, and unlike some baselines, didn’t over-reject valid cases.

Surprising findings:

Bigger isn’t everything: Small but well-trained PROGRESSLM-3B could outperform or rival much larger models on progress reasoning.
Distribution matters: Smooth, continuous predicted scores correlated with better PRC and lower NSE, while collapsed distributions explained many failures.
Strong UDA can be misleading if AFRR is huge: One model had high N/A accuracy by saying N/A too often—honesty means knowing when to answer and when not to.

🍞 Anchor: On a plate-in-rack task, direct prediction might ping-pong between 0% and 100%. With two-stage training, PROGRESSLM retrieves the near-complete frame, adjusts slightly, and outputs ~76%, matching ground truth closely.

05Discussion & Limitations

🍞 Hook: Even great hikers stumble on foggy trails—what are this method’s fog and rocks?

🥬 Limitations:

Text-only demos remain tough: distinguishing similar actions requires tracking hidden state across steps; errors in retrieval ripple into scoring.
Cross-view gaps persist: while reduced for PROGRESSLM, large viewpoint shifts still challenge anchoring.
Single-image assumption: Some real tasks might need short clips to capture motion cues explicitly.
Domain shift: Human-in-the-wild settings degrade most models; PROGRESSLM helps but isn’t magic.
Data & compute: Building CoT traces and doing RL requires careful curation and GPU time.

Required resources:

A base multimodal model (e.g., 3B class), 25K SFT CoT pairs, 20K RL prompts, and RL infrastructure (e.g., GRPO-style training with distributed serving).

When NOT to use:

Tasks with no clear progress notion (open-ended art) or where the observation omits critical state (e.g., occluded key objects) and you cannot verify answerability.
Scenarios demanding frame-by-frame control signals rather than a single percent estimate.

Open questions:

Can we close the text-vs-vision gap with better language-to-state memory or hybrid demos (images + sparse text)?
How to generalize over bigger camera changes (e.g., novel 3D viewpoints) without 3D models?
Can we learn a unified uncertainty head that balances UDA and AFRR optimally?
What minimal video context (2–3 frames?) gives big gains over single images?
How to transfer to real robots with noisy sensors and changing lighting reliably?

🍞 Anchor: If your instructions say 'stack the bowls' but the photo shows cups on the floor, PROGRESSLM says N/A reliably—but if the camera is diagonally rotated and lighting changes, we still want even steadier anchoring.

06Conclusion & Future Work

Three-sentence summary:

This paper reframes 'how far along is the task?' as explicit, two-stage progress reasoning—first anchor to a similar demo step (episodic retrieval), then make a small, explainable adjustment (mental simulation).
A new benchmark, PROGRESS-BENCH, shows that most current VLMs struggle across modalities, viewpoints, and unanswerable cases, while training-free prompting helps only conditionally.
With a 45K-sample training recipe (25K SFT CoT + 20K RL), PROGRESSLM-3B achieves strong, robust, and honest progress estimates even at small scale.

Main achievement:

Turning progress estimation into an interpretable, coupled reasoning skill that generalizes across new tasks and cameras, rather than a brittle, black-box number guess.

Future directions:

Blend visual and textual demos smartly, add lightweight video context, and learn viewpoint-invariant anchors; further refine uncertainty calibration to balance UDA and AFRR.

Why remember this:

It’s a blueprint for teaching AI to reason about processes—not just snapshots—bringing robots closer to reliable, human-like understanding of 'how far along' a job really is.

Practical Applications

•Robot monitoring in factories: estimate step completion to trigger the next operation at the right time.
•Home assistance: know when a task like 'place dishes in the rack' is nearly done to avoid breakage or double work.
•Surgical or medical tool handling (simulation): estimate progress of instrument placement for training feedback.
•Warehouse picking and placing: time handoffs between robots or between robot and human based on progress percent.
•Quality control: flag stalled or regressed steps when the progress order breaks (low PRC) on an assembly line.
•Human–robot collaboration: communicate simple status like 'about 70% done' for better coordination.
•Autonomous labs: monitor experiment steps (e.g., pipette to rack) and decide when to move to the next stage.
•Education and training: teach trainees with visual/text demos and provide calibrated progress feedback.
•Video analytics: summarize how far along an activity is (e.g., cooking, maintenance) from a single frame plus a demo.
•Simulation and planning: use progress estimates to checkpoint and resume long-horizon tasks reliably.

Version: 1