AACR-Bench: Evaluating Automatic Code Review with Holistic Repository-Level Context | How I Study AI

AACR-Bench: Evaluating Automatic Code Review with Holistic Repository-Level Context

Intermediate

Lei Zhang, Yongda Yu, Minghui Yu et al.1/27/2026

Key Summary

•AACR-Bench is a new test set that checks how well AI can do code reviews using the whole project, not just one file.
•It covers 50 popular repositories across 10 programming languages and includes 200 pull requests and 1,505 verified review comments.
•The ground truth was built with an AI-assisted, human expert–verified pipeline, boosting issue coverage by 285% beyond the original PR comments.
•Each review comment is labeled by the context it truly needs (Diff, File, or Repo), so we can measure how models handle cross-file reasoning.
•Results show that adding context doesn’t always help; the best retrieval method depends on the model and the language.
•Agent-based systems are very precise but miss many issues (high precision, low recall), while traditional methods find more but include more noise.
•Programming language matters: models behave differently on C, C#, Python, Java, Go, Rust, etc., showing clear language-specific bias.
•AACR-Bench reveals that the level of context (local vs. repository-wide) and the choice of retrieval strategy can change scores a lot.
•The benchmark provides a fairer, more realistic way to compare automated code review systems before using them in real projects.

Why This Research Matters

This benchmark makes automated code review testing feel like real life: the models get to see whole projects and many languages, just like engineers do. That means teams can pick tools based on trustworthy, apples-to-apples results instead of guesses. It also shows when extra context helps—and when it gets in the way—so companies can tune systems for their specific stacks. By labeling each issue with the context it truly requires, it shines a light on cross-file reasoning, which is where many costly bugs hide. Finally, because the answer key was built by AI plus human experts, it covers more real issues without sacrificing accuracy, making it a strong foundation for research and practical deployments.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how a teacher doesn’t just grade one sentence of your essay—they read the whole thing to really understand your idea? Code review is like that: to catch real problems, you often need to see more than a tiny snippet.

🥬 Filling (The Actual Concept)

What it is: Automated Code Review (ACR) uses computers—often large language models (LLMs)—to read code changes and suggest useful review comments.
How it works: 1) A developer opens a Pull Request (PR) showing what changed. 2) The ACR tool examines the “diff hunks” (the added/removed lines). 3) It may fetch extra context—like other files—to understand dependencies. 4) It then writes comments about bugs, security, performance, or maintainability. 5) Humans read, accept, or reject those comments.
Why it matters: Without good ACR, mistakes slip into production, costing time, money, and sometimes safety.

🍞 Bottom Bread (Anchor) Imagine changing a function’s name in one file but forgetting to update its use in another file. A smart code review tool should catch that by seeing the whole project, not just the changed file.

🍞 Top Bread (Hook) You know how a map app helps you plan a trip across several cities, not just one street? Code review also needs a “map” of the whole repository.

🥬 Filling (The Actual Concept)

What it is: Repository-level context means the model can look beyond one file and consider the entire project (other files, configurations, PR descriptions, and dependencies).
How it works: 1) Identify what changed. 2) Pull in related files, types, and functions. 3) Understand how pieces connect. 4) Judge the change with that bigger picture.
Why it matters: Many bugs are cross-file. If you only see a small diff, you’ll miss issues that live somewhere else.

🍞 Bottom Bread (Anchor) If a new function writes to a log file but the log path is configured in another file, only repo-level context reveals the mismatch.

🍞 Top Bread (Hook) Imagine a spelling checker that sometimes flags the wrong words. If your gold answers are noisy or incomplete, you can’t really tell which checker is best.

🥬 Filling (The Actual Concept)

What it is: Ground Truth is the carefully checked “answer key” used to score model outputs.
How it works: 1) Start with real PR comments. 2) Use multiple AIs to propose more possible issues. 3) Have human experts verify and label the real issues. 4) Remove duplicates. 5) Mark what context (Diff/File/Repo) each issue truly needs.
Why it matters: If the answer key is missing lots of real problems, models that find those problems look “wrong,” and evaluation becomes unfair.

🍞 Bottom Bread (Anchor) AACR-Bench’s answer key has 1,505 verified review comments—285% more coverage than just the original PR comments alone.

🍞 Top Bread (Hook) Think of detectives teaming up: one finds clues, another confirms them. That’s faster and more thorough than working alone.

🥬 Filling (The Actual Concept)

What it is: AI-assisted human expert verification is a workflow where AI suggests issues and trained engineers confirm what’s real.
How it works: 1) Several LLMs generate review comments. 2) Remove duplicates with semantic checks. 3) Two human annotators verify each comment; disagreements go to a core team to resolve. 4) Label issue type and the true context level needed.
Why it matters: It balances coverage (AI finds more) and correctness (humans verify), creating a strong benchmark.

🍞 Bottom Bread (Anchor) In AACR-Bench, 80+ engineers verified comments proposed by six models across two frameworks, dramatically expanding reliable coverage.

🍞 Top Bread (Hook) Picture a worldwide science fair where entries come from many countries. You learn more when you see variety.

🥬 Filling (The Actual Concept)

What it is: Multi-language support means the benchmark covers code written in many programming languages.
How it works: 1) Select 10 popular languages (e.g., Python, Java, Go, C, C#, C++, JavaScript, TypeScript, PHP, Rust). 2) Choose 5 active repositories per language. 3) Collect PRs and comments. 4) Balance topics and sizes. 5) Verify issues.
Why it matters: Models behave differently across languages; testing only one language can hide weaknesses.

🍞 Bottom Bread (Anchor) A rule that’s great for Python might not work for C. AACR-Bench measures both, so you can trust results across many ecosystems.

Putting it all together, before AACR-Bench, many ACR tests used raw PR comments (often noisy and incomplete) and focused on one language (often Python), missing the reality that real software spans languages and cross-file dependencies. AACR-Bench fills that gap by giving models full, multilingual repository context and a carefully verified answer key, so we finally get fair, clear scores that reflect real-world needs.

02Core Idea

🍞 Top Bread (Hook) You know how a puzzle is much easier when you can see the picture on the box and someone has checked the pieces aren’t missing? That’s what this paper does for code review.

🥬 Filling (The Actual Concept)

What it is (one sentence): The key idea is a multilingual, repository-level benchmark—AACR-Bench—built with an AI-assisted, human-verified pipeline that exposes more real issues and fairly measures how well ACR systems work with full-project context.

Multiple Analogies:

Library analogy: Instead of judging a book by a single page (diff), AACR-Bench lets you read the chapter and the index (repo-level), and librarians (experts) verify the summaries (ground truth).
Sports analogy: Rather than timing one sprint, it measures a decathlon across 10 events (languages), with certified referees and replay footage (verified annotations and repo context).
Cooking analogy: Don’t taste just the salt; taste the whole dish (repo), in different cuisines (languages), with a chef panel confirming which recipes actually work (expert verification).

Before vs After:

Before: Benchmarks relied on raw PR comments (noisy, incomplete), were often single-language, and lacked cross-file context, leading to misleading scores.
After: AACR-Bench provides multilingual, repo-level context and an expanded, verified answer key with context-scope labels (Diff/File/Repo). Now we can test retrieval choices, agent strategies, and language effects with confidence.

Why It Works (intuition):

More complete ground truth means fewer “false negatives” during scoring.
Context-scope labels reveal whether a method can handle local vs. cross-file reasoning.
Multilingual coverage exposes training biases and structural language differences that affect model behavior.
Testing multiple retrieval modes (none, BM25, embeddings, agent) shows that “more context” isn’t always “better context”—fit matters.

Building Blocks (each with a mini sandwich):

Repository-level Context

Hook: You can’t judge a team’s play by one screenshot—you need the whole game.
What it is: Full access to project files, metadata, and dependencies during review.
How it works: Index or retrieve related files, connect types/functions across files, interpret diffs with project knowledge.
Why it matters: Cross-file bugs are common; without this, models miss real problems.
Anchor: A new function calls another defined elsewhere—only repo context reveals a wrong parameter type.

Context-Level Labels (Diff, File, Repo)

Hook: Road signs say whether you need a city map, a regional map, or a country map.
What it is: Each ground-truth comment is tagged with the smallest context needed to correctly make it.
How it works: Experts decide whether the issue is visible from just the diff, the whole file, or across the repo.
Why it matters: We can test whether models handle harder, cross-file reasoning.
Anchor: A security issue that depends on a config file elsewhere is marked Repo-level.

AI-Assisted, Human-Verified Ground Truth

Hook: Metal detectors help treasure hunters find more coins, but humans confirm what’s real.
What it is: LLMs propose issues; experts verify and de-duplicate them.
How it works: Multiple models generate comments; semantic de-dup filters repeats; two annotators check; a core team resolves conflicts.
Why it matters: Greatly expands coverage while keeping accuracy high.
Anchor: AACR-Bench increased issue coverage by 285% over raw PR comments.

Multilingual Coverage (10 languages)

Hook: A travel guide that only covers one country won’t help you on a world tour.
What it is: Balanced selection of 5 high-activity repositories per language.
How it works: Filter PRs, ensure English descriptions, size limits, and meaningful inline reviews; stratify by domain and size.
Why it matters: Different languages reveal different model strengths and weaknesses.
Anchor: Models that do great on Python may stumble on C or Rust.

Retrieval and Agent Strategies

Hook: Sometimes a magnifying glass helps; sometimes it distracts.
What it is: Compare No context, BM25, Embedding retrieval, and Agent (Claude Code) modes.
How it works: Provide top-3 retrieved contexts (for BM25/Embedding) or let an agent decide and plan; then generate reviews and score them.
Why it matters: The right context strategy depends on the model and the language.
Anchor: Claude in Agent mode improves precision a lot but often lowers recall; GPT-5.2 prefers BM25 over Agent in this benchmark.

In short, the big “aha!” is that a fair, realistic test for automated code review must combine broad language coverage, full project context, and verified, context-labeled answers. With that, we finally see when, why, and for whom extra context helps—or hurts.

03Methodology

At a high level: Real PRs → Filter and augment → AI generate more comments → De-duplicate → Human expert verify and label (issue type + context scope) → Build benchmark → Evaluate models with different retrieval/agent settings → Score with precision/recall/F1.

Step-by-step (with mini sandwiches for key parts):

Select Languages and Repositories

Hook: If you only taste one flavor of ice cream, you can’t judge the whole shop.
What it is: Pick 10 mainstream languages; choose 5 very active repositories per language.
How it works: Use 2025 StackOverflow survey for language list; filter GitHub repos ranked top-2,000 for new stars and closed PRs (Dec 2024–Dec 2025); pick the top 5 by stars per language.
Why it matters: Keeps data fresh, diverse, and realistic.
Anchor: 50 repositories across C, C++, C#, Go, Java, JavaScript, Python, TypeScript, PHP, Rust.

Pull Request Collection and Filtering

Hook: Not every comment in a group chat is useful; you keep the ones with real info.
What it is: Gather PRs and review threads, then filter for quality.
How it works: Collect 12,715 PRs with titles/descriptions, diffs, and inline comments. Apply rules: English text; ≤1,000 changed lines; language consistency with repo; ≥2 inline comments with at least one accepted; remove trivial or non-semantic changes; stratify by repository, problem domain, and change size.
Why it matters: Ensures each PR is meaningful and reviewable.
Anchor: Final core set: 200 PRs.

Augment Human Review Threads (Deep Summaries)

Hook: Turning a messy conversation into a clear takeaway makes it actionable.
What it is: Use an LLM to condense multi-turn review threads into crisp, confirmed issue comments.
How it works: Focus on the revision with the most inline comments; analyze diff + conversation; extract confirmed defects; produce “Augmented Review Comments.”
Why it matters: Raw threads are noisy; augmentation turns them into clean, checkable items.
Anchor: Example: Identify that a function was left empty and should be removed or commented for intent.

Generate More Candidate Comments with Multiple LLMs

Hook: Several flashlights find more lost items than one.
What it is: Use 6 different models and 2 frameworks to propose additional review comments.
How it works: Models include Claude-4.5-Sonnet, GPT-5.2, Qwen3-Coder-480B-A35B-Instruct, DeepSeek-V3.2, GLM-4.7, Gemini-3-Pro. Run them via two systems (an internal review system and Claude Code agent). Merge results with the augmented human comments.
Why it matters: Different models notice different issues, boosting coverage.
Anchor: Many valid issues are found by only one model, so multi-model generation is key.

Semantic De-duplication

Hook: If three friends say the same thing differently, you still count it once.
What it is: Remove near-duplicate comments that express the same concern.
How it works: Group by repo/PR/file/diff hunk; use an LLM to compare pairs; keep one when semantically the same; repeat to stabilize decisions.
Why it matters: Keeps the dataset tidy and avoids inflating counts.
Anchor: Two comments both warn about a potential null pointer—keep one.

Human Expert Verification and Labeling

Hook: A referee team confirms the final score.
What it is: 80+ senior engineers verify correctness, label issue types, and assign context scope.
How it works: Double-blind annotation by two people per comment; disagreements resolved by a 6-person core team; label issue type (e.g., security, defect, performance, maintainability) and required context level (Diff, File, Repo).
Why it matters: Balances breadth (AI proposals) with accuracy (human checks).
Anchor: Final set: 1,505 verified review comments—391 from augmented original reviews and 1,114 from LLM+expert augmentation.

Context-Level Annotation

Hook: Different microscopes for different magnifications.
What it is: Tag each comment with the smallest context truly required—Diff, File, or Repo.
How it works: Annotators judge whether the problem is visible from just the changed lines, needs the whole file, or depends on cross-file repo knowledge.
Why it matters: Lets us test and compare models’ local vs. global reasoning.
Anchor: A performance issue visible in a loop (Diff) vs. a misused config defined elsewhere (Repo).

Benchmark Structure and Scoring

Hook: A fair race needs clear rules and finish lines.
What it is: Each PR is an evaluation unit; models traverse its diff hunks and produce comments.
How it works: Provide PR title/description to all methods; optionally provide retrieved code contexts (3 items for BM25/Embedding; agent decides autonomously). Score by matching generated comments to ground-truth comments on (a) overlapped line ranges and (b) semantic equivalence; report Precision, Recall, F1.
Why it matters: Gives consistent, comparable numbers.
Anchor: If a model flags “nil pointer risk” on the correct lines with the same reason, it scores.

Retrieval and Agent Configurations

Hook: Choosing the right tool for the job matters as much as the job itself.
What it is: Compare four modes—No context, BM25, Embedding (Qwen3-Embedding-8B), and Agent (Claude Code).
How it works: No context: just diff + PR text. BM25/Embedding: retrieve top-3 code contexts. Agent: autonomously plan, navigate, and fetch whatever context it needs.
Why it matters: The same model can behave very differently depending on how context is supplied.
Anchor: Claude-4.5-Sonnet shines in Agent mode precision; GPT-5.2 prefers BM25 over Agent here.

The secret sauce:

Hybrid answer key: AI proposes; humans verify.
Context labels: Reveal where reasoning breaks—at local diffs or whole repos.
Multilingual coverage: Surfaces training and structural biases across languages.
Comparative context strategies: Proves that more context can help—or hurt—depending on the setup.

Concrete examples in the dataset:

Empty function implementation (C++): Suggest removing or documenting placeholders.
Unnecessary shared_ptr copy (C++): Prefer const reference for performance and idiomatic style.
Potential nil pointer dereference (Go): Add checks before accessing possibly nil fields.
Silent failure on JSON parsing (TypeScript/Node): Throw or propagate errors instead of only logging.
Console logging in production path (TypeScript): Use conditional or structured logging to avoid overhead.

Putting the steps together, AACR-Bench flows like a careful recipe: collect diverse ingredients (PRs), prepare them (filter), add seasoning (AI suggestions), taste-test (expert verification), label flavors (context scope), and then serve the dish to tasters (models) under different plating styles (retrieval/agent modes) to see which pairings work best.

04Experiments & Results

The Test (what and why):

Measure how well different LLM-based ACR methods find real issues across languages and context needs. Use standard metrics: Precision (how many found issues are correct), Recall (how many real issues are found), and F1 (balance of both). Also measure recall by context level (Diff/File/Repo) to see if models handle cross-file reasoning.

The Competition (who/what compared):

Methods: No context, BM25 retrieval, Embedding retrieval (Qwen3-Embedding-8B), and Agent (Claude Code).
Models: Claude-4.5-Sonnet, GPT-5.2, Qwen3-Coder-480B-A35B-Instruct, DeepSeek-V3.2, GLM-4.7.
Data: 200 PRs, 1,505 verified review comments from 50 repos in 10 languages.

The Scoreboard (with context):

Agent-based precision vs. recall: Claude-4.5-Sonnet in Agent mode reached about 39.90% precision but only ~10.10% recall and produced very few comments per patch (~0.08)—like a careful student who only answers when sure. This is high-precision but low-coverage behavior. Other models in Agent mode often did worse (e.g., GPT-5.2 dropped to ~9.90% precision and ~2.99% recall), showing that not all models benefit from agent orchestration.
Retrieval isn’t always a boost: For Claude-4.5-Sonnet, simple BM25 or Embedding retrieval hurt F1 compared to No context (e.g., BM25 F1 ≈ 9.98 vs. No context F1 ≈ 14.46). Meanwhile, DeepSeek-V3.2 improved with BM25 (F1 ≈ 15.59), and Qwen-480B-Coder preferred Embeddings (F1 ≈ 14.36). There is no one-size-fits-all retrieval method.
Context level matters: For non-agent methods, recall steadily fell as required context expanded: Diff > File > Repo. Example: Qwen-480B-Coder recall dipped from ~33.82% (Diff) to ~22.59% (File) to ~17.60% (Repo) in No context mode. In contrast, Agent setups sometimes did better on Repo-level issues than on Diff-level ones, suggesting agents can navigate complex context but may miss obvious local issues.
Language differences are real: Performance varied widely by language. Claude-4.5-Sonnet (Agent) formed a top tier on Python, Java, Go, and C, but dipped on TypeScript, PHP, and Rust. GPT-5.2 in No context mode did great on C# (F1 ≈ 0.309) but poorly on C (≈ 0.085), hinting that language structure (e.g., strong typing, explicit namespaces) and training data richness impact results.

Make the numbers meaningful:

Think of 39.90% precision as getting 4 out of 10 flags right—much better than 1 out of 10—but with only a few total flags raised, you’ll still miss many real problems (low recall ≈ 10.10%).
An F1 drop from No context to BM25 for Claude means the extra context acted like noise, not help—like reading too many unrelated footnotes during a test.
DeepSeek’s improvement with BM25 shows that, for some models, classic keyword-style retrieval still works better than embeddings.

Surprising findings:

More context can hurt: Adding top-3 retrieved files sometimes diluted focus and lowered scores.
Agents are picky: Agent benefits depended strongly on the base model; a great chat model didn’t automatically become a great agent reviewer.
Local vs. global trade-off: Agents sometimes excel at repo-level reasoning but underperform on simple diff-level catches, likely due to “context tunnel vision.”
Language structure matters: C#’s explicit types and namespaces may make it easier for models to reason; C’s pointers and macros can conceal dependencies and confuse retrieval.

Overall, AACR-Bench reveals that ACR success depends on three knobs: the model, the language, and how you feed context (none/BM25/embeddings/agent). Tuning these wisely makes a bigger difference than just picking the biggest model.

05Discussion & Limitations

Limitations (be specific):

Ground truth is strong but not perfect: Real software is complex and subjective. Even with 80+ experts and multi-model proposals, some valid issues may remain unlabeled or debatable.
Not all languages or domains: Ten popular languages and 50 repos are broad, but specialized stacks (embedded, DSLs) aren’t covered.
PR-size constraint: Capped at ≤1,000 changed lines to keep review realistic; super-large refactors are out of scope.
Agent choice: Only one public agent framework (Claude Code) was benchmarked; other agents may behave differently.
Retrieval hyperparameters: Top-3 contexts for similarity methods is a reasonable default, but other settings could change outcomes.

Required resources:

Repo access and tooling: Need to clone repositories, compute diffs, and optionally index code for retrieval.
Model access: Commercial APIs or powerful open-source models; embedding model for vector retrieval.
Human time if extending: Expanding the benchmark or re-annotating requires expert reviewers.

When NOT to use:

Ultra-large monorepos or massive PRs (over the size cap) where context volume overwhelms current retrieval or agent strategies.
Niche languages or frameworks not represented; results may not transfer.
High-stakes security reviews where specialized static/dynamic analyzers are required; ACR findings should complement, not replace, formal tools.

Open questions:

Adaptive context: How can a system decide, per hunk, whether to stay local or explore repo-wide context?
Unifying local and global: Can we blend precise diff-level checks with robust cross-file reasoning without adding noise?
Precision–recall balance: What training or orchestration best raises recall while maintaining high precision?
Language-aware retrieval: How do we tailor retrieval for pointer-heavy C vs. decorator-rich Python vs. trait-based Rust?
Better matching metrics: Can we score semantic matches more robustly across paraphrases and line shifts without over-crediting vague comments?

06Conclusion & Future Work

Three-sentence summary: AACR-Bench is a multilingual, repository-level benchmark for automated code review that uses an AI-assisted, human-verified pipeline to create a richer, more accurate ground truth. It labels each comment by the true context needed (Diff/File/Repo) and evaluates multiple retrieval and agent strategies, revealing that context level, retrieval choice, model, and language all interact in complex ways. The results show that more context isn’t always better, agents are precise but can be narrow, and language structure strongly affects performance.

Main achievement: It establishes a realistic, high-coverage, and context-aware standard for evaluating LLM-based code review systems across 10 programming languages and full repositories.

Future directions:

Build adaptive context systems that decide when to zoom in (diff) or zoom out (repo).
Design hybrid pipelines that keep agent-level precision while lifting recall.
Develop language-specific retrieval recipes and noise-robust reasoning.
Expand dataset scale and refine ground truth with semi-automated methods.

Why remember this: AACR-Bench changes how we test code review AIs: it measures what truly matters in real development—cross-file reasoning, language diversity, and reliable, verified answers—so teams can pick the right model and the right context strategy with confidence.

Practical Applications

•Compare different ACR tools on your tech stack using AACR-Bench’s multilingual, repo-level scenarios before buying or integrating.
•Choose the right retrieval strategy (No context, BM25, Embedding, Agent) per language to boost performance without adding noise.
•Set CI policies that accept only high-precision agent comments for blocking, while logging broader non-agent findings for triage.
•Train or fine-tune your in-house ACR model using AACR-Bench-style context labels to improve cross-file reasoning.
•Create language-specific playbooks (e.g., different configs for C vs. Python) based on the benchmark’s language-wise insights.
•Diagnose failure modes (local vs. repo-level misses) by analyzing performance across Diff/File/Repo labels.
•Use the dataset as a curriculum for onboarding engineers, demonstrating real review issues and their required context.
•Benchmark agent orchestration tweaks (e.g., retrieval depth, validation prompts) to balance precision and recall.
•Stress-test embedding models and indexes by measuring whether top-3 retrievals help or hinder for your codebase.
•Adopt a hybrid pipeline (agent gate + retrieval scan) to reduce false positives while improving coverage.

Version: 1