QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models
Key Summary
- âąQuantiPhy is a new test that checks if AI models can measure real-world physics from videos using numbers, not guesses.
- âąIt asks models to estimate size, speed, or acceleration in real units (like meters or m/s) after giving just one true fact called a prior.
- âąThe benchmark covers four kinds of tasks: 2D vs. 3D motion and static (size) vs. dynamic (speed/acceleration) priors.
- âąQuantiPhy scores answers with Mean Relative Accuracy (MRA), which rewards being close to the right number, not just exactly right.
- âąAcross 21 models, even the best ones often sound reasonable but miss the exact numbers; humans averaged 55.6% MRA, top model got 53.1%.
- âąModels often rely on memorized world knowledge instead of really reading the video or the given prior, especially under counterfactual tests.
- âąGiving step-by-step prompts (chain-of-thought) rarely helped; small mistakes early on tend to grow bigger later.
- âąBackground details and extra objects sometimes helped models by giving more reference clues for scale and motion.
- âąQuantiPhy is open and standardized, so different models can be compared fairly with the same prompts and scoring.
- âąThis benchmark aims to push AI from âsounds rightâ to âmeasures right,â which matters for robots, AR/VR, and safe autonomy.
Why This Research Matters
Real-world systems need numbers, not just nice descriptions. A car needs to know how fast the bike ahead is moving to brake safely, and a robot needs to measure how far to reach for a cup. QuantiPhy checks whether AI models can turn videos into accurate sizes, speeds, and accelerations when given one true anchor. This matters for safety in autonomous driving, realism in AR/VR, and reliability in home and warehouse robots. It also helps catch models that sound convincing but ignore the actual video or the userâs input numbers. By making testing fair and numerical, QuantiPhy pushes AI toward trustworthy, physics-aware behavior.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: Imagine youâre timing how fast your friend runs across the playground using your phone video. You donât just guessâyou count footsteps, measure distance, and divide by time. Thatâs turning a video into numbers.
đ„Ź The Concept (Vision-Language Models, VLMs): VLMs are AIs that look at images or videos and talk about them. How it works: 1) See the picture/video, 2) Read the question, 3) Use learned patterns to answer in words. Why it matters: If VLMs only talk in general terms (like âthe car is fastâ) but canât measure (âthe car is 6.2 m/sâ), they canât safely help in real-world tasks.
đ Anchor: A VLM might describe âa person crossing the street,â but QuantiPhy asks, âHow fast are they moving in m/s at 2.0 s?â
đ Hook: You know how in multiple-choice quizzes, two wrong answers can be âa little wrongâ or âway wrongâ? Treating them the same isnât fair.
đ„Ź The Concept (VQA vs. Quantitative Testing): VQA is about picking or saying a descriptive answer. Quantitative testing checks numeric accuracy. How it works: 1) Ask for a number with units, 2) Compare to ground truth, 3) Score how close you are. Why it matters: Saying â31 mâ instead of â3 mâ for a car length should be penalized a lot more than â3.1 m.â
đ Anchor: In class, 3.1 m would get you partial credit, 31 m would not. QuantiPhy gives that kind of fair scoring to AI.
đ Hook: Imagine watching a soccer ball on video and wanting to know its speed. You see pixels moveâbut you need meters.
đ„Ź The Concept (Pixel Space vs. World Space): Pixel space is what you measure on the screen; world space is real units (m, m/s). How it works: 1) Track motion in pixels, 2) Use one true real-world prior (like the ballâs size) to convert pixels to meters, 3) Scale all other measurements. What breaks without it: Without a known scale, pixels canât become meters; everything stays vague.
đ Anchor: If a car is 135 px long and you know itâs 5.67 m in real life, then 1 px â 0.042 m; now pixel speeds become real speeds.
đ Hook: Think of a treasure map that gives you one clue that unlocks all the others.
đ„Ź The Concept (Physical Prior): A prior is one trustworthy fact (size, speed at a moment, or acceleration at a moment). How it works: 1) Provide one prior in world units, 2) Match it to the same thing in pixels, 3) Compute a scale factor, 4) Apply to get the asked quantity. Why it matters: One anchor turns guesses into measurements.
đ Anchor: Know the coinâs diameter = 2.4 cm. Measure coin in video = 60 px. Now 1 px = 0.04 cm; the robot can size other objects.
đ Hook: If you film a skateboarder coming straight across your view, itâs 2D-ish; if they move toward you, itâs 3D and trickier.
đ„Ź The Concept (2D vs. 3D Motion): 2D assumes flat, same depth; 3D includes changing depth. How it works: 1) 2D: constant depth, simple scaling; 2) 3D: add depth info to scale properly. Why it matters: Without handling depth, speeds/lengths look wrong when objects move toward/away from the camera.
đ Anchor: A ball rolling left to right (2D) is simpler than a drone flying toward the camera (3D).
đ Hook: Report cards donât just mark âright/wrongâ; they summarize how close you are to mastery.
đ„Ź The Concept (Mean Relative Accuracy, MRA): MRA measures how close a modelâs number is to the truth across several tolerance levels. How it works: 1) Compute relative error, 2) Check if itâs within various tight-to-loose thresholds, 3) Average the passes. Why it matters: It rewards being âclose enoughâ and punishes huge mistakes more.
đ Anchor: Guessing 3.1 m for a 3 m object scores much higher than 31 m.
đ Hook: Before QuantiPhy, many tests rewarded âsounding rightâ; the world needs âmeasuring right.â
đ„Ź The Concept (Quantitative Benchmarking): Itâs a standardized way to check numeric performance. How it works: 1) Same prompts and priors for all models, 2) Parse numeric outputs with units, 3) Score with MRA, 4) Compare on a leaderboard. Why it matters: Fair, apples-to-apples comparisons push progress.
đ Anchor: Like running the same 100 m race on the same track and timing everyone with the same stopwatch.
The world before: VLMs impressed us with descriptions, stories, and multiple-choice answers, but nobody knew if they could compute real numbers from real motion. The problem: Natural scenes rarely include camera parameters or rulers; without a scale, pixels canât become meters. Failed attempts: Qualitative VQA couldnât tell âa bit offâ from âway offâ and didnât stress input-faithful measuring. The gap: A rigorous, numeric, video-focused test that turns one known prior into all needed quantities. Real stakes: Robots, AR glasses, and self-driving cars must measure, not merely describeâsafety and usefulness depend on numbers.
02Core Idea
đ Hook: You know how one LEGO piece with the right shape lets you lock a whole build together? One true number in a video can lock all the other numbers into place.
đ„Ź The Concept (The Aha! Moment): Give the model one real-world anchor (a prior) and ask it to scale pixel measurements into meters, m/s, or m/s^2âthen grade how close it gets. How it works: 1) Track pixels, 2) Match the prior in pixels to its real value, 3) Compute a scale factor, 4) Convert target quantity to real units, 5) Score with MRA. Why it matters: This turns âlooks rightâ into âmeasures right,â revealing whether models truly read videos and use the given prior.
đ Anchor: Know a carâs length = 5.67 m; at 2.0 s, its center moved 120 px in one second. If 1 px = 0.042 m, then speed â 5.0 m/s.
Three analogies:
- Map legend: If the legend says 1 cm = 1 km, any line you measure becomes a real distance. The prior is that legend. 2) Recipe scaling: Double the flour, double the cookies. Change the prior, and all results should scale with it. 3) Shoe size ruler: Stand on the foot scale once, then all toe marks become centimeters. One calibration unlocks everything else.
Before vs. After:
- Before: VLMs earned points for plausible talk and multiple-choice picks; big numeric mistakes could hide behind fancy words.
- After: Models must compute numbers from videos and a single anchor, and their closeness is scored. The mask comes offâeither you measure or you donât.
đ Hook: Imagine trying to count steps in a dance while someone keeps changing the musicâs speed; you must follow the beat youâre given.
đ„Ź The Concept (Counterfactual Priors): A counterfactual prior purposely changes the anchor (like making a car 1,000Ă longer). How it works: 1) Provide an altered prior, 2) The correct answer must scale accordingly, 3) Check if the model follows the new beat. Why it matters: If a model ignores the given number and sticks to âwhat cars usually are,â itâs guessing, not measuring.
đ Anchor: If a ballâs diameter is said to be 0.5 m (not 0.25 m), all distances and speeds in meters should double. If they donât, the model didnât follow the input.
Why it works (intuition without equations): Videos give you shape and motion in pixels. One trustworthy real-world fact creates a bridge from pixels to meters. Once the bridge is built, every other quantity can cross it. If the bridge (the prior) moves, all the traffic (the answers) must move with it.
Building blocks:
- đ/đ„Ź/đ Physical prior (size or time-stamped speed/acceleration) to set the scale.
- đ/đ„Ź/đ Pixel-to-world scaling to convert everything else.
- đ/đ„Ź/đ 2D vs. 3D handling (constant depth vs. depth changes with extra depth cues).
- đ/đ„Ź/đ Standardized prompts that demand ânumber + unitâ only.
- đ/đ„Ź/đ MRA scoring that rewards being close and punishes huge misses.
- đ/đ„Ź/đ Diagnostic stress-tests: prior-only (remove video), counterfactual prior (change the anchor), and chain-of-thought (force step-by-step) to see if models truly measure.
đ Anchor: QuantiPhy turns a playground video plus one real fact (the track lane is 1.22 m wide) into a math lab where the AI must compute a runnerâs speed accurately.
03Methodology
At a high level: Input (video + one prior) â Pixel measurement â Compute scale factor â Convert target quantity to world units â Output a single number with unit â Score with MRA.
đ Hook: Think of it like baking: one good measuring cup (the prior) makes all your ingredient amounts correct.
đ„Ź The Concept (The Recipe):
- Collect videos with moving objects (2â3 s long) from simulations, lab captures, and internet clips. 2) Provide exactly one physical prior in text: size (static) or a time-stamped speed/acceleration (dynamic). 3) Ask a question about size, speed, or acceleration (possibly for a different object in the clip). 4) Require the modelâs final answer to be ânumber + unitâ only. 5) Parse the number, compare with ground truth, and compute MRA. What breaks without it: Without the prior, pixels never become meters; without strict parsing, you canât grade; without MRA, tiny errors and huge errors get mixed unfairly.
đ Anchor: âGiven the coinâs diameter is 2.4 cm, what is the toy carâs speed at 1.5 s in cm/s?â
Step-by-step with concrete data:
- Example: A yellow car moves sideways in a 30 fps clip. Prior: âCar length = 5.67 m.â Measured pixel length on the frame near 2.0 s: 135 px. Scale factor Îł â 5.67 / 135 â 0.042 m/px. Next, track the car center: frame 59 â x = 400 px; frame 60 â x = 430 px. Pixel velocity â (430 â 400) / (1/30) = 30 px Ă 30 fps = 900 px/s (illustrative). World velocity â 900 Ă 0.042 â 37.8 m/s (if thatâs inconsistent with the actual ground-truth trajectory, the MRA will reflect it). Why this step exists: It shows the exact bridge from pixels to meters.
đ Anchor: If a basketball diameter is 0.24 m and its pixel diameter is 60 px, then 1 px = 0.004 m. If it moves 45 px between frames at 30 fps: speed = (45 px Ă 0.004 m/px) Ă 30 â 5.4 m/s.
đ Hook: Flat skatepark vs. ramps with hillsâ2D is flat, 3D has depth.
đ„Ź The Concept (2D vs. 3D Categories): 2D assumes constant depth (xây motion only), 3D allows depth change (z motion). How it works: 1) 2D: one scale works frame-to-frame. 2) 3D: add depth info (from lab capture or annotations) so near/far changes rescale correctly. What breaks without it: A drone flying toward the camera will look âtoo big/fastâ if depth isnât handled.
đ Anchor: A ball rolling left-to-right on a table (2D) vs. a toy car driving toward the camera (3D with changing scale).
đ Hook: Knowing what to measure is half the battle.
đ„Ź The Concept (Static vs. Dynamic Priors): Static = a fixed size (e.g., bottle height). Dynamic = speed or acceleration at a given time (e.g., 1.5 s speed). How it works: 1) Static: set Îł from size, then convert velocities/accelerations. 2) Dynamic: match a known speed or acceleration at a timestamp to the pixel version to get Îł. What breaks without it: You canât lock the pixel-to-world bridge at the needed moment.
đ Anchor: If at 1.5 s the skateboardâs speed is 2.0 m/s and you measure 50 px/s, then 1 px = 0.04 m; now you can compute size or acceleration in meters.
Prompting and parsing:
- Standardized prompt: include the prior text, ask for a specific quantity/time, instruct âoutput only the number and unit.â Parsing: extract the first valid number+unit; if none appears, count as failure. Up to five tries per item to reduce flaky outputs.
Scoring with MRA:
- Compute relative error |pred â truth| / |truth|. Check if it is within 10%, 20%, ⊠up to 5%. Average these passes to get MRA. This balances tolerance for measurement noise with punishment for big misses.
Secret sauce (diagnostic probes):
- Prior-only: Remove the video, keep the prior and question. If scores stay high, the model is likely guessing from world knowledge. - Counterfactual prior: Multiply the prior by factors like 0.1Ă, 5Ă, 100Ă. If the model is truly scaling, its answers should change by the same factor; if not, itâs ignoring the input. - Chain-of-thought (CoT): Force a four-step pathâ(I) measure source in pixels, (II) compute scale, (III) measure target in pixels, (IV) report world valueâto see if stepwise guidance helps.
đ Anchor: Itâs like testing a student three ways: with the full lab kit (video+prior), with only the formula (prior-only), and by making them show every calculation step (CoT).
04Experiments & Results
The test: Measure how close 21 VLMs get to the true number on 3,355 videoâtext pairs across four categories: 2D-Static, 2D-Dynamic, 3D-Static, 3D-Dynamic. The metric is Mean Relative Accuracy (MRA), which rewards being âclose enoughâ across multiple thresholds.
The competition: 6 proprietary models (e.g., ChatGPTâ5.1, Geminiâ2.5 Pro/Flash, Grokâ4.1, Claudeâ4.5 Sonnet) and 15 open-weight models (e.g., Qwen3âVL family, InternVLâ3.5 family, Molmoâ7B, Phiâ4 Multimodal, etc.). A human baseline study gives a reference for typical human performance without pixel tools.
The scoreboard with context:
- Humans average 55.6% MRA overallâgood, given they canât read pixels exactly. - Best model: ChatGPTâ5.1 at 53.1%âclose to humans but not surpassing on average. Geminiâ2.5 Pro at 49.6%, Flash at 48.6%, Grokâ4.1 at 45.0%. - Best open-weight: Qwen3âVLâInstructâ32B at 46.0%; InternVLâ3.5â30B at 40.7%. - Scaling helps: bigger models in the same family usually score higher, especially on dynamic tasks (where time and motion matter), but still donât fully close the gap to humans.
Surprising findings:
- Prior-only vs. Video+Prior: Removing the video often doesnât hurt muchâmodels can guess okay using world knowledge (like âcars are ~2 m wideâ). Thatâs a red flag for real measurement. - Counterfactual prior: When the prior is multiplied by 0.1Ă, 5Ă, 100Ă, most models fail to scale their answers accordingly; MRA drops sharply (often ~70â80%). This shows theyâre not faithfully using the provided number. - Chain-of-thought prompting: For most models, forcing step-by-step did not help and sometimes hurt; early small errors in pixel reading or scaling get amplified. A few models improved, but it wasnât a general fix.
Scene context effects:
- Background complexity: Mild effects overall; interestingly, realistic complex scenes sometimes help by offering extra visual rulers (tiles, windows, lane markings). - Number of moving objects: Multiple objects generally improve performance (more references to compare sizes/speeds).
Big picture: Even top models hover around human-like MRA instead of surpassing it, despite having exact pixel access in principle. This means current VLMs lean heavily on prior knowledge and verbal plausibility instead of building reliable pixel-to-world bridges when asked for numbers.
Analogy for the results: Itâs like students giving nice-sounding science explanations but failing the measurement labâthey talk well but donât always use the ruler you handed them.
05Discussion & Limitations
Limitations:
- Translational motion only: No rotations yet (spins, turns), and no deformable/soft bodies. - Fixed cameras: Real life often has moving cameras, zoom, or jitter; handling that is future work. - Rigid-object bias: People, flags, and jellyfish-like motion can break the ârigidâ assumption. - Simplified interactions: Mostly isolated motions, not crowded physics puzzles with many contacts.
Required resources:
- To run the benchmark: access to model APIs or GPUs for open-weight models, video decoding, and a few GB of storage/time for 3.3K instances. - For 3D understanding: some models may benefit from auxiliary depth or optical flow toolsânot required by QuantiPhy but useful in research.
When NOT to use:
- If your scenario depends on rotations, camera ego-motion, or fluid/soft-body behavior, QuantiPhy scores wonât fully reflect your needs. - If your model only outputs text descriptions without numeric precision, it will underperform by design.
Open questions:
- How to enforce input faithfulness so models must use the given prior and the actual video, not just memories of âtypicalâ sizes and speeds? - Can physics-aware training (synthetic scenes with precise labels, optical flow, or differentiable physics) boost quantitative reasoning? - What prompting or tool-use patterns make CoT actually help rather than amplify early mistakes? - How to robustly handle realistic camera motion and rotations while keeping evaluation fair and scalable? - Can we build better intermediate checks (like requiring the model to output pixel measurements first) that correlate with final accuracy and faithfulness?
Honest take: QuantiPhy shows that todayâs VLMs often act like âsmart guessersâ rather than âcareful measurers.â Bridging that gap is essential for trustworthy embodied AI.
06Conclusion & Future Work
Three-sentence summary: QuantiPhy is a standardized, quantitative benchmark that tests whether VLMs can turn videos plus one real-world prior into accurate numbers for size, speed, and acceleration. Across 21 models, even the best systems tend to rely on memorized world knowledge rather than faithfully using the video and the given prior, especially under counterfactual tests. The MRA-based leaderboard and diagnostic probes reveal a clear path forward: build models that measure, not just describe.
Main achievement: Turning kinematic reasoning into a rigorous pixel-to-world scaling problemâwith one anchor prior, strict numeric outputs, and graded MRA scoringâso we can finally assess whether models compute or just conjecture.
Future directions:
- Expand motion types (rotations, deformables), camera motion, and multi-object interactions. - Encourage input-faithful reasoning with physics-informed training, tool-augmented pipelines (e.g., optical flow + trackers), and better prompt/format constraints. - Design stepwise evaluations that reward correct intermediate measurements to reduce error cascades.
Why remember this: QuantiPhy moves AI from âsounds rightâ to âmeasures right,â a crucial step for robots, AR/VR, and autonomy where numbersânot narrativesâkeep people safe and systems reliable.
Practical Applications
- âąBenchmark a new VLMâs physical measurement skill before deploying it on robots.
- âąStress-test input faithfulness by swapping priors (counterfactuals) and checking if outputs scale correctly.
- âąAugment model training with pixel-to-world scaling tasks derived from QuantiPhy items.
- âąUse QuantiPhy to validate AR apps that overlay distances/speeds, ensuring numbers match reality.
- âąEvaluate video generation realism by measuring if synthetic motions obey consistent kinematics.
- âąCompare open-weight vs. proprietary models fairly with the same prompts and MRA scoring.
- âąPrototype tool-augmented pipelines (tracking, optical flow) and measure gains on QuantiPhy.
- âąDesign prompts that force numeric outputs and units, and track parsing success rates.
- âąRun prior-only ablations to detect over-reliance on world knowledge rather than video evidence.
- âąUse scene variants (simple vs. complex, single vs. multiple objects) to diagnose failure modes.