QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models

Li Puyin; Tiange Xiang; Ella Mao; Shirley Wei; Xinye Chen; Adnan Masood; Li Fei-fei; Ehsan Adeli

QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models

Intermediate

Li Puyin, Tiange Xiang, Ella Mao et al.12/22/2025

arXiv PDF

Key Summary

•QuantiPhy is a new test that checks if AI models can measure real-world physics from videos using numbers, not guesses.
•It asks models to estimate size, speed, or acceleration in real units (like meters or m/s) after giving just one true fact called a prior.
•The benchmark covers four kinds of tasks: 2D vs. 3D motion and static (size) vs. dynamic (speed/acceleration) priors.
•QuantiPhy scores answers with Mean Relative Accuracy (MRA), which rewards being close to the right number, not just exactly right.
•Across 21 models, even the best ones often sound reasonable but miss the exact numbers; humans averaged 55.6% MRA, top model got 53.1%.
•Models often rely on memorized world knowledge instead of really reading the video or the given prior, especially under counterfactual tests.
•Giving step-by-step prompts (chain-of-thought) rarely helped; small mistakes early on tend to grow bigger later.
•Background details and extra objects sometimes helped models by giving more reference clues for scale and motion.
•QuantiPhy is open and standardized, so different models can be compared fairly with the same prompts and scoring.
•This benchmark aims to push AI from ‘sounds right’ to ‘measures right,’ which matters for robots, AR/VR, and safe autonomy.

Why This Research Matters

Real-world systems need numbers, not just nice descriptions. A car needs to know how fast the bike ahead is moving to brake safely, and a robot needs to measure how far to reach for a cup. QuantiPhy checks whether AI models can turn videos into accurate sizes, speeds, and accelerations when given one true anchor. This matters for safety in autonomous driving, realism in AR/VR, and reliability in home and warehouse robots. It also helps catch models that sound convincing but ignore the actual video or the user’s input numbers. By making testing fair and numerical, QuantiPhy pushes AI toward trustworthy, physics-aware behavior.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re timing how fast your friend runs across the playground using your phone video. You don’t just guess—you count footsteps, measure distance, and divide by time. That’s turning a video into numbers.

🥬 The Concept (Vision-Language Models, VLMs): VLMs are AIs that look at images or videos and talk about them. How it works: 1) See the picture/video, 2) Read the question, 3) Use learned patterns to answer in words. Why it matters: If VLMs only talk in general terms (like “the car is fast”) but can’t measure (“the car is 6.2 m/s”), they can’t safely help in real-world tasks.

🍞 Anchor: A VLM might describe ‘a person crossing the street,’ but QuantiPhy asks, ‘How fast are they moving in m/s at 2.0 s?’

🍞 Hook: You know how in multiple-choice quizzes, two wrong answers can be ‘a little wrong’ or ‘way wrong’? Treating them the same isn’t fair.

🥬 The Concept (VQA vs. Quantitative Testing): VQA is about picking or saying a descriptive answer. Quantitative testing checks numeric accuracy. How it works: 1) Ask for a number with units, 2) Compare to ground truth, 3) Score how close you are. Why it matters: Saying ‘31 m’ instead of ‘3 m’ for a car length should be penalized a lot more than ‘3.1 m.’

🍞 Anchor: In class, 3.1 m would get you partial credit, 31 m would not. QuantiPhy gives that kind of fair scoring to AI.

🍞 Hook: Imagine watching a soccer ball on video and wanting to know its speed. You see pixels move—but you need meters.

🥬 The Concept (Pixel Space vs. World Space): Pixel space is what you measure on the screen; world space is real units (m, m/s). How it works: 1) Track motion in pixels, 2) Use one true real-world prior (like the ball’s size) to convert pixels to meters, 3) Scale all other measurements. What breaks without it: Without a known scale, pixels can’t become meters; everything stays vague.

🍞 Anchor: If a car is 135 px long and you know it’s 5.67 m in real life, then 1 px ≈ 0.042 m; now pixel speeds become real speeds.

🍞 Hook: Think of a treasure map that gives you one clue that unlocks all the others.

🥬 The Concept (Physical Prior): A prior is one trustworthy fact (size, speed at a moment, or acceleration at a moment). How it works: 1) Provide one prior in world units, 2) Match it to the same thing in pixels, 3) Compute a scale factor, 4) Apply to get the asked quantity. Why it matters: One anchor turns guesses into measurements.

🍞 Anchor: Know the coin’s diameter = 2.4 cm. Measure coin in video = 60 px. Now 1 px = 0.04 cm; the robot can size other objects.

🍞 Hook: If you film a skateboarder coming straight across your view, it’s 2D-ish; if they move toward you, it’s 3D and trickier.

🥬 The Concept (2D vs. 3D Motion): 2D assumes flat, same depth; 3D includes changing depth. How it works: 1) 2D: constant depth, simple scaling; 2) 3D: add depth info to scale properly. Why it matters: Without handling depth, speeds/lengths look wrong when objects move toward/away from the camera.

🍞 Anchor: A ball rolling left to right (2D) is simpler than a drone flying toward the camera (3D).

🍞 Hook: Report cards don’t just mark ‘right/wrong’; they summarize how close you are to mastery.

🥬 The Concept (Mean Relative Accuracy, MRA): MRA measures how close a model’s number is to the truth across several tolerance levels. How it works: 1) Compute relative error, 2) Check if it’s within various tight-to-loose thresholds, 3) Average the passes. Why it matters: It rewards being ‘close enough’ and punishes huge mistakes more.

🍞 Anchor: Guessing 3.1 m for a 3 m object scores much higher than 31 m.

🍞 Hook: Before QuantiPhy, many tests rewarded ‘sounding right’; the world needs ‘measuring right.’

🥬 The Concept (Quantitative Benchmarking): It’s a standardized way to check numeric performance. How it works: 1) Same prompts and priors for all models, 2) Parse numeric outputs with units, 3) Score with MRA, 4) Compare on a leaderboard. Why it matters: Fair, apples-to-apples comparisons push progress.

🍞 Anchor: Like running the same 100 m race on the same track and timing everyone with the same stopwatch.

The world before: VLMs impressed us with descriptions, stories, and multiple-choice answers, but nobody knew if they could compute real numbers from real motion. The problem: Natural scenes rarely include camera parameters or rulers; without a scale, pixels can’t become meters. Failed attempts: Qualitative VQA couldn’t tell ‘a bit off’ from ‘way off’ and didn’t stress input-faithful measuring. The gap: A rigorous, numeric, video-focused test that turns one known prior into all needed quantities. Real stakes: Robots, AR glasses, and self-driving cars must measure, not merely describe—safety and usefulness depend on numbers.

02Core Idea

🍞 Hook: You know how one LEGO piece with the right shape lets you lock a whole build together? One true number in a video can lock all the other numbers into place.

🥬 The Concept (The Aha! Moment): Give the model one real-world anchor (a prior) and ask it to scale pixel measurements into meters, m/s, or m/s^2—then grade how close it gets. How it works: 1) Track pixels, 2) Match the prior in pixels to its real value, 3) Compute a scale factor, 4) Convert target quantity to real units, 5) Score with MRA. Why it matters: This turns ‘looks right’ into ‘measures right,’ revealing whether models truly read videos and use the given prior.

🍞 Anchor: Know a car’s length = 5.67 m; at 2.0 s, its center moved 120 px in one second. If 1 px = 0.042 m, then speed ≈ 5.0 m/s.

Three analogies:

Map legend: If the legend says 1 cm = 1 km, any line you measure becomes a real distance. The prior is that legend. 2) Recipe scaling: Double the flour, double the cookies. Change the prior, and all results should scale with it. 3) Shoe size ruler: Stand on the foot scale once, then all toe marks become centimeters. One calibration unlocks everything else.

Before vs. After:

Before: VLMs earned points for plausible talk and multiple-choice picks; big numeric mistakes could hide behind fancy words.
After: Models must compute numbers from videos and a single anchor, and their closeness is scored. The mask comes off—either you measure or you don’t.

🍞 Hook: Imagine trying to count steps in a dance while someone keeps changing the music’s speed; you must follow the beat you’re given.

🥬 The Concept (Counterfactual Priors): A counterfactual prior purposely changes the anchor (like making a car 1,000× longer). How it works: 1) Provide an altered prior, 2) The correct answer must scale accordingly, 3) Check if the model follows the new beat. Why it matters: If a model ignores the given number and sticks to ‘what cars usually are,’ it’s guessing, not measuring.

🍞 Anchor: If a ball’s diameter is said to be 0.5 m (not 0.25 m), all distances and speeds in meters should double. If they don’t, the model didn’t follow the input.

Why it works (intuition without equations): Videos give you shape and motion in pixels. One trustworthy real-world fact creates a bridge from pixels to meters. Once the bridge is built, every other quantity can cross it. If the bridge (the prior) moves, all the traffic (the answers) must move with it.

Building blocks:

🍞/🥬/🍞 Physical prior (size or time-stamped speed/acceleration) to set the scale.
🍞/🥬/🍞 Pixel-to-world scaling to convert everything else.
🍞/🥬/🍞 2D vs. 3D handling (constant depth vs. depth changes with extra depth cues).
🍞/🥬/🍞 Standardized prompts that demand ‘number + unit’ only.
🍞/🥬/🍞 MRA scoring that rewards being close and punishes huge misses.
🍞/🥬/🍞 Diagnostic stress-tests: prior-only (remove video), counterfactual prior (change the anchor), and chain-of-thought (force step-by-step) to see if models truly measure.

🍞 Anchor: QuantiPhy turns a playground video plus one real fact (the track lane is 1.22 m wide) into a math lab where the AI must compute a runner’s speed accurately.

03Methodology

At a high level: Input (video + one prior) → Pixel measurement → Compute scale factor → Convert target quantity to world units → Output a single number with unit → Score with MRA.

🍞 Hook: Think of it like baking: one good measuring cup (the prior) makes all your ingredient amounts correct.

🥬 The Concept (The Recipe):

Collect videos with moving objects (2–3 s long) from simulations, lab captures, and internet clips. 2) Provide exactly one physical prior in text: size (static) or a time-stamped speed/acceleration (dynamic). 3) Ask a question about size, speed, or acceleration (possibly for a different object in the clip). 4) Require the model’s final answer to be ‘number + unit’ only. 5) Parse the number, compare with ground truth, and compute MRA. What breaks without it: Without the prior, pixels never become meters; without strict parsing, you can’t grade; without MRA, tiny errors and huge errors get mixed unfairly.

🍞 Anchor: ‘Given the coin’s diameter is 2.4 cm, what is the toy car’s speed at 1.5 s in cm/s?’

Step-by-step with concrete data:

Example: A yellow car moves sideways in a 30 fps clip. Prior: ‘Car length = 5.67 m.’ Measured pixel length on the frame near 2.0 s: 135 px. Scale factor γ ≈ 5.67 / 135 ≈ 0.042 m/px. Next, track the car center: frame 59 → x = 400 px; frame 60 → x = 430 px. Pixel velocity ≈ (430 − 400) / (1/30) = 30 px × 30 fps = 900 px/s (illustrative). World velocity ≈ 900 × 0.042 ≈ 37.8 m/s (if that’s inconsistent with the actual ground-truth trajectory, the MRA will reflect it). Why this step exists: It shows the exact bridge from pixels to meters.

🍞 Anchor: If a basketball diameter is 0.24 m and its pixel diameter is 60 px, then 1 px = 0.004 m. If it moves 45 px between frames at 30 fps: speed = (45 px × 0.004 m/px) × 30 ≈ 5.4 m/s.

🍞 Hook: Flat skatepark vs. ramps with hills—2D is flat, 3D has depth.

🥬 The Concept (2D vs. 3D Categories): 2D assumes constant depth (x–y motion only), 3D allows depth change (z motion). How it works: 1) 2D: one scale works frame-to-frame. 2) 3D: add depth info (from lab capture or annotations) so near/far changes rescale correctly. What breaks without it: A drone flying toward the camera will look ‘too big/fast’ if depth isn’t handled.

🍞 Anchor: A ball rolling left-to-right on a table (2D) vs. a toy car driving toward the camera (3D with changing scale).

🍞 Hook: Knowing what to measure is half the battle.

🥬 The Concept (Static vs. Dynamic Priors): Static = a fixed size (e.g., bottle height). Dynamic = speed or acceleration at a given time (e.g., 1.5 s speed). How it works: 1) Static: set γ from size, then convert velocities/accelerations. 2) Dynamic: match a known speed or acceleration at a timestamp to the pixel version to get γ. What breaks without it: You can’t lock the pixel-to-world bridge at the needed moment.

🍞 Anchor: If at 1.5 s the skateboard’s speed is 2.0 m/s and you measure 50 px/s, then 1 px = 0.04 m; now you can compute size or acceleration in meters.

Prompting and parsing:

Standardized prompt: include the prior text, ask for a specific quantity/time, instruct ‘output only the number and unit.’ Parsing: extract the first valid number+unit; if none appears, count as failure. Up to five tries per item to reduce flaky outputs.

Scoring with MRA:

Compute relative error |pred − truth| / |truth|. Check if it is within 10%, 20%, … up to 5%. Average these passes to get MRA. This balances tolerance for measurement noise with punishment for big misses.

Secret sauce (diagnostic probes):

Prior-only: Remove the video, keep the prior and question. If scores stay high, the model is likely guessing from world knowledge. - Counterfactual prior: Multiply the prior by factors like 0.1×, 5×, 100×. If the model is truly scaling, its answers should change by the same factor; if not, it’s ignoring the input. - Chain-of-thought (CoT): Force a four-step path—(I) measure source in pixels, (II) compute scale, (III) measure target in pixels, (IV) report world value—to see if stepwise guidance helps.

🍞 Anchor: It’s like testing a student three ways: with the full lab kit (video+prior), with only the formula (prior-only), and by making them show every calculation step (CoT).

04Experiments & Results

The test: Measure how close 21 VLMs get to the true number on 3,355 video–text pairs across four categories: 2D-Static, 2D-Dynamic, 3D-Static, 3D-Dynamic. The metric is Mean Relative Accuracy (MRA), which rewards being ‘close enough’ across multiple thresholds.

The competition: 6 proprietary models (e.g., ChatGPT‑5.1, Gemini‑2.5 Pro/Flash, Grok‑4.1, Claude‑4.5 Sonnet) and 15 open-weight models (e.g., Qwen3‑VL family, InternVL‑3.5 family, Molmo‑7B, Phi‑4 Multimodal, etc.). A human baseline study gives a reference for typical human performance without pixel tools.

The scoreboard with context:

Humans average 55.6% MRA overall—good, given they can’t read pixels exactly. - Best model: ChatGPT‑5.1 at 53.1%—close to humans but not surpassing on average. Gemini‑2.5 Pro at 49.6%, Flash at 48.6%, Grok‑4.1 at 45.0%. - Best open-weight: Qwen3‑VL‑Instruct‑32B at 46.0%; InternVL‑3.5‑30B at 40.7%. - Scaling helps: bigger models in the same family usually score higher, especially on dynamic tasks (where time and motion matter), but still don’t fully close the gap to humans.

Surprising findings:

Prior-only vs. Video+Prior: Removing the video often doesn’t hurt much—models can guess okay using world knowledge (like ‘cars are ~2 m wide’). That’s a red flag for real measurement. - Counterfactual prior: When the prior is multiplied by 0.1×, 5×, 100×, most models fail to scale their answers accordingly; MRA drops sharply (often ~70–80%). This shows they’re not faithfully using the provided number. - Chain-of-thought prompting: For most models, forcing step-by-step did not help and sometimes hurt; early small errors in pixel reading or scaling get amplified. A few models improved, but it wasn’t a general fix.

Scene context effects:

Background complexity: Mild effects overall; interestingly, realistic complex scenes sometimes help by offering extra visual rulers (tiles, windows, lane markings). - Number of moving objects: Multiple objects generally improve performance (more references to compare sizes/speeds).

Big picture: Even top models hover around human-like MRA instead of surpassing it, despite having exact pixel access in principle. This means current VLMs lean heavily on prior knowledge and verbal plausibility instead of building reliable pixel-to-world bridges when asked for numbers.

Analogy for the results: It’s like students giving nice-sounding science explanations but failing the measurement lab—they talk well but don’t always use the ruler you handed them.

05Discussion & Limitations

Limitations:

Translational motion only: No rotations yet (spins, turns), and no deformable/soft bodies. - Fixed cameras: Real life often has moving cameras, zoom, or jitter; handling that is future work. - Rigid-object bias: People, flags, and jellyfish-like motion can break the ‘rigid’ assumption. - Simplified interactions: Mostly isolated motions, not crowded physics puzzles with many contacts.

Required resources:

To run the benchmark: access to model APIs or GPUs for open-weight models, video decoding, and a few GB of storage/time for 3.3K instances. - For 3D understanding: some models may benefit from auxiliary depth or optical flow tools—not required by QuantiPhy but useful in research.

When NOT to use:

If your scenario depends on rotations, camera ego-motion, or fluid/soft-body behavior, QuantiPhy scores won’t fully reflect your needs. - If your model only outputs text descriptions without numeric precision, it will underperform by design.

Open questions:

How to enforce input faithfulness so models must use the given prior and the actual video, not just memories of ‘typical’ sizes and speeds? - Can physics-aware training (synthetic scenes with precise labels, optical flow, or differentiable physics) boost quantitative reasoning? - What prompting or tool-use patterns make CoT actually help rather than amplify early mistakes? - How to robustly handle realistic camera motion and rotations while keeping evaluation fair and scalable? - Can we build better intermediate checks (like requiring the model to output pixel measurements first) that correlate with final accuracy and faithfulness?

Honest take: QuantiPhy shows that today’s VLMs often act like ‘smart guessers’ rather than ‘careful measurers.’ Bridging that gap is essential for trustworthy embodied AI.

06Conclusion & Future Work

Three-sentence summary: QuantiPhy is a standardized, quantitative benchmark that tests whether VLMs can turn videos plus one real-world prior into accurate numbers for size, speed, and acceleration. Across 21 models, even the best systems tend to rely on memorized world knowledge rather than faithfully using the video and the given prior, especially under counterfactual tests. The MRA-based leaderboard and diagnostic probes reveal a clear path forward: build models that measure, not just describe.

Main achievement: Turning kinematic reasoning into a rigorous pixel-to-world scaling problem—with one anchor prior, strict numeric outputs, and graded MRA scoring—so we can finally assess whether models compute or just conjecture.

Future directions:

Expand motion types (rotations, deformables), camera motion, and multi-object interactions. - Encourage input-faithful reasoning with physics-informed training, tool-augmented pipelines (e.g., optical flow + trackers), and better prompt/format constraints. - Design stepwise evaluations that reward correct intermediate measurements to reduce error cascades.

Why remember this: QuantiPhy moves AI from ‘sounds right’ to ‘measures right,’ a crucial step for robots, AR/VR, and autonomy where numbers—not narratives—keep people safe and systems reliable.

Practical Applications

•Benchmark a new VLM’s physical measurement skill before deploying it on robots.
•Stress-test input faithfulness by swapping priors (counterfactuals) and checking if outputs scale correctly.
•Augment model training with pixel-to-world scaling tasks derived from QuantiPhy items.
•Use QuantiPhy to validate AR apps that overlay distances/speeds, ensuring numbers match reality.
•Evaluate video generation realism by measuring if synthetic motions obey consistent kinematics.
•Compare open-weight vs. proprietary models fairly with the same prompts and MRA scoring.
•Prototype tool-augmented pipelines (tracking, optical flow) and measure gains on QuantiPhy.
•Design prompts that force numeric outputs and units, and track parsing success rates.
•Run prior-only ablations to detect over-reliance on world knowledge rather than video evidence.
•Use scene variants (simple vs. complex, single vs. multiple objects) to diagnose failure modes.

Version: 1