A2Eval: Agentic and Automated Evaluation for Embodied Brain

Shuai Zhang; Jiayu Hu; Zijie Chen; Zeyuan Ding; Yi Zhang; Yingji Zhang; Ziyi Zhou; Junwei Liao; Shengjie Zhou; Yong Dai; Zhenzhong Lan; Xiaozhu Ju

A2Eval: Agentic and Automated Evaluation for Embodied Brain

Intermediate

Shuai Zhang, Jiayu Hu, Zijie Chen et al.2/2/2026

arXiv PDF

Key Summary

•A2Eval is a two-agent system that automatically builds and runs fair tests for robot-style vision-language models, cutting wasted work while keeping results trustworthy.
•It discovers the right skill categories (like spatial reasoning and planning) on its own and picks a small but varied set of test questions so models are judged evenly.
•Across 10 famous embodied benchmarks and 13 models, A2Eval shrinks the test suite by 85% (24,519 to 3,781 samples) without losing coverage.
•It reduces overall compute cost by 77% and speeds up evaluation by up to 4.6×, saving days of GPU time for big models.
•Despite being smaller and cheaper, its rankings match human preferences better (Spearman’s ρ = 0.85) than the original expert-built suites.
•It keeps rankings faithful to the originals (ρ = 0.94, τ = 0.81) while fixing bias caused by over-represented easy tasks.
•Its evaluation pipelines (inference + scoring) are auto-written and auto-checked, achieving 96.9% fidelity to reference implementations.
•A human–agent agreement study shows strong consistency on skill labels (Cohen’s κ ≈ 0.78; human IAA ≈ 0.80).
•A2Eval is the first end-to-end agentic framework that both curates balanced tests and executes them automatically for embodied AI.

Why This Research Matters

Fair, fast evaluation is the compass that guides embodied AI toward useful, real-world skills. By removing duplicates and balancing skills like physics and planning, A2Eval prevents “gaming the test” and makes leaderboards reflect genuine capability. Its 77% compute savings and up to 4.6× speedups mean researchers and companies can iterate more often with less cost and energy. Better human alignment reduces the risk of shipping models that look good on paper but fail in homes, hospitals, or warehouses. Because A2Eval auto-writes and verifies pipelines, results are more reproducible across labs. This raises the standard for trustworthy, environmentally friendlier evaluation in embodied AI.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine grading a class where most test questions are about spelling, very few are about writing stories, and many questions are just copies of each other. Some students who memorize spelling lists look like stars, even if they can’t write a paragraph. That’s not fair.

🥬 Filling (The Actual Concept): The world before A2Eval relied on static, expert-built benchmarks for embodied vision-language models (VLMs)—AI that sees, understands, and plans actions in the world.

What it is: These benchmarks were big collections of tasks about perception, spatial reasoning, physics, planning, and more—all hand-designed and hand-labeled.
How it works (before): Experts collected tasks across many datasets, then researchers ran models across all of them to get scores and rankings.
Why it matters: Rankings guide what gets built next. If the test is skewed or bloated, we waste compute, spend lots of money, and chase the wrong improvements.

🍞 Bottom Bread (Anchor): Think of 10 different spelling tests that all ask the same 90% of questions. You’ll finish exhausted, without knowing who can truly write well.

🍞 Top Bread (Hook): You know how some games feel fun because they try lots of different challenges—puzzles, races, and strategy—so you can’t win by being good at just one thing?

🥬 Filling (The Actual Concept): The problem the researchers faced was a broken evaluation ecosystem.

What it is: Today’s embodied benchmarks have heavy redundancy (up to 92% similarity), skewed coverage (easy tasks like simple spatial cues dominate), and huge cost (over 3,200 GPU hours to evaluate one model across fragmented suites).
How it hurts:
1. Coverage imbalance and redundancy: many near-duplicates and over-represented tasks.
2. Ranking distortion: models that overfit to common, easy tasks look great even if they’re weak at physics or planning.
3. Prohibitive evaluation cost: running everything (and writing per-benchmark inference and scoring code) is slow and expensive.
Why it matters: Bad rankings mislead research and product decisions; slow iterations stall progress.

🍞 Bottom Bread (Anchor): It’s like practicing only layups in basketball and then being crowned the “best all-around player.” The tests fool everyone.

🍞 Top Bread (Hook): Imagine reorganizing a messy bookshelf. Instead of guessing what books you have, you scan them all, sort by topic, and keep one or two great examples of each—fast and neat.

🥬 Filling (The Actual Concept): Previous attempts tried to generate new benchmarks (like Code2Bench or OK-Bench) or add system frameworks, but didn’t fix consolidation, redundancy, or cross-benchmark bias.

What they did: Focused on making fresh datasets with rules for cleanliness and reproducibility.
Why it fell short: They didn’t merge existing diverse sources into one compact, balanced, redundancy-aware suite, nor did they automate the full evaluation pipeline.
What breaks without consolidation: You still pay the full cost, keep the skew, and get distorted rankings.

🍞 Bottom Bread (Anchor): If your closet is overflowing, buying more hangers isn’t enough—you need to sort, keep the essentials, and toss duplicates.

🍞 Top Bread (Hook): Think of building a fair sports tryout: you pick clear skill categories (speed, agility, strength), choose varied drills for each, and score everyone the same way, automatically.

🥬 Filling (The Actual Concept): The gap A2Eval fills is an agentic, end-to-end system that both curates a balanced test and runs the evaluation automatically.

What it is: A2Eval uses two collaborating agents—Data Agent and Eval Agent—to discover capability dimensions, sample a compact but diverse set of examples, and auto-synthesize reliable inference and scoring pipelines.
How it works:
1. Data Agent induces capability dimensions (like Spatial & Geometric Reasoning or Physical & Causal Reasoning), assigns examples, and removes redundancy through clustering.
2. Eval Agent writes and validates runnable code that feeds models the inputs and scores their answers consistently.
Why it matters: You get fairer rankings, 85% smaller tests, 77% less compute, and 4.6× speedups—while matching human judgment better.

🍞 Bottom Bread (Anchor): Like turning a messy exam into a tidy, fair quiz that still checks all the important skills but takes a quarter of the time—and the grades make more sense to teachers.

02Core Idea

🍞 Top Bread (Hook): You know how a chef’s tasting menu gives you just a few bites that still cover all the flavors? It’s short but complete.

🥬 Filling (The Actual Concept): A2Eval’s key insight is to treat building and running benchmarks as an optimization that two agents can do automatically: cover all important skills, cut duplicates, and execute fair scoring without human hand-holding.

What it is (one sentence): Automate both the test-building (which skills, which examples) and the test-running (how to query models and score them) so evaluation becomes balanced, compact, and trustworthy.
How it works:
1. Discover the right skill buckets (capability dimensions) from many datasets automatically.
2. Assign each example to a skill and sample a small, diverse set per skill using clustering.
3. Auto-write, auto-run, and auto-correct inference and scoring code in a sandbox until it works reliably.
Why it matters: Without this, evaluations stay bloated, biased, and expensive, and rankings can’t be fully trusted.

🍞 Bottom Bread (Anchor): It’s like making a playlist with just one great song from each genre, then having a DJ who can play and judge them perfectly every time.

— Multiple Analogies —

School Fairness: Instead of a 300-question test mostly about vocabulary, A2Eval makes an 8-section quiz where each section is balanced (perception, spatial reasoning, numbers, affordances, physics, planning, dynamics, scene understanding).
Library Curator: From thousands of similar books, keep a few that represent each topic; you still learn the whole field without reading repeats.
Sports Tryout: Pick drills that evenly test speed, agility, strength, strategy, and stamina; don’t let 80% of the tryout be sprinting.

— Before vs After —

Before: Expert-defined categories, many redundant samples, per-benchmark coding, skewed rankings, very high GPU cost.
After: Agent-induced categories, compact diversity-aware sampling, auto-generated inference and scoring, corrected ranking biases, 4.6× faster.

— Why It Works (Intuition) —

Balanced dimensions prevent any one easy skill (like simple spatial cues) from overpowering the final score.
Clustering picks a spread of examples that cover the full “meaning space,” so removing duplicates doesn’t remove coverage.
Sandbox-validated code ensures consistent, reproducible scoring across models and benchmarks.
Together, this preserves ranking fidelity (Spearman’s ρ ≈ 0.94 with original) while improving human alignment (ρ ≈ 0.85).

— Building Blocks (each with a Sandwich) —

🍞 Hook: You know how we sort school subjects into math, science, and reading? 🥬 Capability Dimensions:
- What: The main skill buckets for embodied reasoning (PercepObj, SceneAct, SpatGeo, QuantNum, AffdFunc, PhysCaus, DecPlan, DynScene).
- How: The system proposes, critiques, and finalizes the set until coverage is complete and non-overlapping.
- Why: Without clear buckets, you can’t balance or interpret results. 🍞 Anchor: Like grading students separately in math and reading so a reading whiz doesn’t hide a math gap.
🍞 Hook: Imagine a committee that designs the best, shortest test. 🥬 Data Agent:
- What: An agent that discovers dimensions, assigns examples via multi-voter majority, and picks diverse samples via clustering.
- How: Proposer suggests dimensions; Reviewer critiques balance/overlap; Assigner labels and samples with diversity.
- Why: Without it, you keep redundant, skewed tests. 🍞 Anchor: A small, well-chosen set of questions across all subjects beats a giant pile of repeats.
🍞 Hook: Think of a coach who writes drills and the referee who tracks points. 🥬 Eval Agent:
- What: An agent that auto-writes inference and scoring code, tests it in a sandbox, and iterates until correct.
- How: Generate code → run → read errors → fix → validate; then run on the full compact benchmark.
- Why: Without it, you hand-code per dataset, risking bugs and inconsistencies. 🍞 Anchor: A universal, reliable “grader” makes scores fair across players.

03Methodology

At a high level: Input (many benchmarks) → Data Agent (Dimension Induction → Assignment → Diversity-Aware Sampling) → Eval Agent (Inference Logic → Scoring Logic) → Output (per-skill and overall scores).

🍞 Hook: You know how a team brainstorms ideas, gets feedback, then finalizes a plan? 🥬 Dimension Induction (Proposer–Reviewer Loop):
- What: Automatically discover a clean list of capability dimensions for embodied VLMs.
- How (step by step):
  1. Proposer reads all benchmark info and suggests a dimension set D.
  2. Reviewer checks for overlap, missing skills, and balance problems.
  3. Memory stores proposals and critiques to avoid repeating mistakes.
  4. Iterate until D stabilizes, then validate via real assignments and sampling feedback.
- Why: Without a good, stable skill map, you can’t fairly balance the test. 🍞 Anchor: Like refining school subjects until you cover everything important without duplicates (no “math-1” and “math-2”).
🍞 Hook: Imagine three judges labeling each question’s main topic, then taking a vote. 🥬 Dimension Assignment (Assigner with Voters):
- What: Assign each example to the single most relevant dimension via majority vote.
- How:
  1. N voter agents each predict a dimension for an example.
  2. The final label is decided by majority voting.
  3. Pool examples per dimension.
- Why: One judge can be noisy; voting stabilizes labels and reduces misclassification. 🍞 Anchor: If 5 teachers vote an essay is “science,” that’s more reliable than 1 teacher’s opinion.
🍞 Hook: Picture spreading pushpins across a map so they aren’t clumped in one corner. 🥬 Diversity-Aware Sampling (Clustering):
- What: Keep K = 500 varied examples per dimension by clustering embeddings and picking one near each centroid.
- How:
  1. Encode text + visuals (CLIP for images/videos; sentence embeddings for text).
  2. Cluster each dimension’s pool into K groups.
  3. Select the example closest to each cluster center.
- Why: Without diversity sampling, you keep many near-duplicates and miss rare but important cases. 🍞 Anchor: From 10,565 spatial samples (SpatGeo), keep 500 that cover all patterns, not 500 copies of the same layout.
🍞 Hook: Think of a careful lab assistant who writes the exact steps to run every experiment so anyone can repeat it. 🥬 Model Inference Logic (Evaluator role):
- What: Auto-generate runnable code that loads the model once, feeds inputs (32 frames for videos), and outputs predictions.
- How:
  1. Write code → run in sandbox.
  2. If errors occur, read diagnostics and fix.
  3. Finalize when stable across samples.
- Why: Hand-written scripts vary and break; auto-validated code is consistent and reproducible. 🍞 Anchor: A single reliable function that calls a VLM on all test items with the same settings every time.
🍞 Hook: Now imagine a fair referee who knows the exact scoring rules for each game. 🥬 Scoring Logic (Scorer role):
- What: Auto-generate code that scores predictions correctly (exact match, multiple-choice, numbers, etc.).
- How:
  1. Write scoring function → run in sandbox on predictions.
  2. Fix until outputs are valid and consistent.
  3. Produce per-dimension and overall metrics.
- Why: Without reliable scoring, two models could be graded differently on the same answer. 🍞 Anchor: The referee’s scorecard is the same for everyone, every time.
🍞 Hook: You know how a smaller backpack is easier to carry if it still has everything you need? 🥬 Benchmark Compression:
- What: Shrink the suite by 85% (24,519 → 3,781) while keeping balanced coverage across 8 dimensions.
- How: Uniform K = 500 per dimension when possible; retain all when fewer exist (e.g., PhysCausal kept 366).
- Why: Smaller, balanced tests cut cost by 77% and speed up by up to 4.6× without losing evaluation quality. 🍞 Anchor: Qwen3-VL-235B-A22B-Thinking goes from 412.9 to 89.4 hours—same insights, less waiting.

Secret Sauce (What makes it clever):

Multi-agent induction and voting find and balance the right skills.
Embedding-based clustering keeps wide semantic coverage with few samples.
Sandbox-validated code eliminates per-dataset hand-coding and bugs.
Result: Strong human alignment (ρ = 0.85) and high fidelity (96.9%) in a fraction of the time.

04Experiments & Results

🍞 Hook: If two judges rank runners the same way, you trust the race; if one judge also saves time and money, you love that race.

🥬 The Test:

What: Validate that A2Eval’s compact benchmark stays faithful to original rankings, aligns better with humans, and saves major compute.
How: Use 10 embodied benchmarks (e.g., COSMOS, ERQA, Where2Place, VSI-Bench, OmniSpatial, EgoSchema, BLINK, RefSpatial, RoboSpatial, EmbSpatialBench) and 13 VLMs (Qwen2.5-VL to Qwen3-VL, InternVL3.5, GPT-5 Mini). Compare original vs A2Eval’s curated suite. Measure ranking correlations (Spearman’s ρ, Kendall’s τ), human alignment, and wall-clock time.
Why: Without these checks, a smaller test might miss important skills or mis-rank models.

🍞 Anchor: It’s like showing that a 30-minute quiz gives almost the same ranking as a 3-hour test—and people agree it feels fairer.

The Competition:

Baseline: The union of original, expert-defined, manually annotated benchmarks.
A2Eval: Agentic consolidation with balanced dimensions and diversity-aware sampling, plus auto inference/scoring.

Scoreboard with Context:

Compression and Speed: Suite shrinks by 85% (24,519 → 3,781); costs drop 77%; speedups 3.4×–4.6×. Example: 412.9 h → 89.4 h for Qwen3-VL-235B-A22B-Thinking. That’s like finishing in one weekend vs almost two.
Fidelity to Original: Ranking correlation between A2Eval and original stays high (Spearman’s ρ = 0.94; Kendall’s τ = 0.81). Like giving almost the same class ranks with a much shorter test.
Human Alignment: A2Eval improves agreement with human rankings (ρ = 0.85; τ = 0.72) vs original (ρ = 0.83; τ = 0.64). That’s like moving from a B+ to an A- in listening to human judgment.
Eval Agent Reliability: Inference + scoring pipelines achieve 96.9% average fidelity to references; dimension-wise inference fidelity ≈ 93.6% and scoring fidelity ≈ 97.9%. That’s like your auto-grader matching the teacher almost perfectly.
Human–Agent Agreement on Labels: Cohen’s κ ≈ 0.78; inter-annotator agreement ≈ 0.80—strong consistency on which skill each item tests.

Surprising Findings:

Corrected Rankings: After rebalancing dimensions (e.g., boosting under-tested PhysCaus, DecPlan; reducing over-weighted SpatGeo), some model orders flip in a way humans prefer. Case: On the original suite, Qwen2.5-VL-72B-Instruct beats Qwen3-VL-30B-A3B-Instruct (51.07 vs 48.36). In A2Eval’s balanced suite, the 30B model leads (61.47 vs 54.76), matching human preference (62.5 vs 61.3).
Ablation Insights: Copying source proportions keeps source agreement high (ρ = 0.98) but hurts human alignment (ρ = 0.81). Dimension-aware balancing helps (ρ = 0.82). Adding diversity-aware sampling boosts human alignment to ρ = 0.85 while keeping good source agreement (ρ = 0.94).

🍞 Anchor: When you test basketball players equally on shooting, defense, passing, and teamwork—not just layups—the team you pick is closer to what a coach would pick.

05Discussion & Limitations

🍞 Hook: Even the best maps need good starting data and may miss new roads.

🥬 Limitations:

Data Dependence: A2Eval learns from existing benchmarks. If the pool lacks rare but crucial cases, the curated suite may still miss them.
Residual Bias: While balancing helps, upstream annotation biases can persist in the input pool.
Fixed K per Dimension: Using K = 500 is simple and balanced, but not always optimal for very uneven dimensions.
Domain Shift: New domains with little or no prior data may need bootstrapping before A2Eval shines.
Long-Video Nuances: Uniform 32-frame sampling is practical but may skip subtle temporal cues in very long videos.

Resources Required:

Strong LLMs for agent roles (e.g., Gemini 3 Pro, GPT-4o), CLIP and sentence embeddings, and a sandbox executor. Some GPU and engineering overhead are needed for initial setup.

When NOT to Use:

Brand-new domains with scarce examples.
Tasks requiring interactive, online robot trials rather than offline Q&A/video reasoning.
If you explicitly need every original item (e.g., historical comparability mandates the full set).

Open Questions:

Adaptive K: Can we choose per-dimension K based on measured diversity and difficulty?
Streaming/Interactive Eval: Can we extend agentic pipelines to real-time embodied interactions and safety checks?
Robustness to Contamination: How to detect and handle training-test overlaps automatically across mixed sources?
Generalization: How well does agentic consolidation transfer to other multimodal or non-embodied domains (e.g., code, audio)?

🍞 Anchor: It’s like a smart study guide that’s great at summarizing known material but still needs fresh chapters when the course changes.

06Conclusion & Future Work

🍞 Hook: Think of turning a messy, expensive exam into a small, fair quiz that teachers trust more and students finish faster.

🥬 Three-Sentence Summary: A2Eval is the first end-to-end agentic framework that both curates and executes embodied VLM evaluations automatically. Its Data Agent discovers balanced capability dimensions and selects a compact, diverse set of examples, while its Eval Agent auto-writes reliable inference and scoring code. The result is an 85% smaller benchmark, 77% less compute, 4.6× speedups, and rankings that better match human judgment while preserving fidelity to the originals.

Main Achievement: Showing that automated, capability-aware consolidation plus auto-validated pipelines can fix redundancy, reduce cost, and improve fairness—without sacrificing accuracy.

Future Directions: Make K adaptive per dimension, extend to interactive/streaming robot evaluations, add bias/contamination detectors, and port the framework to other domains (audio, code, math).

Why Remember This: A2Eval proves we don’t need giant, skewed test piles to get trustworthy leaderboards; with smart agents, we can evaluate faster, cheaper, and more fairly—accelerating progress in embodied AI the way a great coach accelerates a team’s growth.

Practical Applications

•Build a compact, balanced benchmark for a new embodied domain by running the Data Agent on existing datasets.
•Periodically re-curate benchmarks as new data arrives to keep coverage balanced without manual relabeling.
•Auto-generate standardized inference and scoring code for each new model to ensure reproducible comparisons.
•Use per-dimension scores (e.g., PhysCaus vs DecPlan) to target model training on the weakest skills.
•Adopt the 32-frame sampling recipe for video to cut compute while keeping temporal coverage.
•Deploy A2Eval in CI pipelines to quickly sanity-check new model versions before large releases.
•Run ablations (dimension-aware vs diversity-aware) to tune human alignment for your domain.
•Leverage human–agent agreement studies to validate or refine the induced capability taxonomy.
•Monitor ranking fidelity metrics (Spearman’s ρ, Kendall’s τ) while adjusting K per dimension for efficiency.
•Port the agentic workflow to adjacent tasks (e.g., robot affordance datasets) to reduce annotation and compute.

Version: 1