Confucius Code Agent: Scalable Agent Scaffolding for Real-World Codebases

Sherman Wong; Zhenting Qi; Zhaodong Wang; Nathan Hu; Samuel Lin; Jun Ge; Erwin Gao; Wenlin Chen; Yilun Du; Minlan Yu; Ying Zhang

Confucius Code Agent: Scalable Agent Scaffolding for Real-World Codebases

Beginner

Sherman Wong, Zhenting Qi, Zhaodong Wang et al.12/11/2025

arXiv PDF

Key Summary

•This paper introduces the Confucius Code Agent (CCA), a coding helper built to handle huge real-world codebases with long tasks and many tools.
•It separates three views—Agent Experience (AX), User Experience (UX), and Developer Experience (DX)—so the AI thinks clearly, people can see what it’s doing, and builders can easily improve it.
•CCA uses smart context management with hierarchical memory and context compression so it remembers important steps without overflowing the model’s limits.
•A built-in note-taking system writes reusable, human-readable Markdown notes, including lessons from failures, so the agent gets better across sessions.
•Modular extensions act like safe, pluggable tools for searching files, editing code, and running commands, making behavior reliable and easy to audit.
•A meta-agent automatically builds, tests, and improves the agent’s prompts and tool rules, speeding up development and raising reliability.
•On the tough SWE-Bench-Pro benchmark, CCA reaches a Resolve@1 of 59%, beating prior research scaffolds and reported commercial setups under the same conditions.
•Ablations show that both advanced context management and meta-learned tool use are big drivers of the gains, not just the backbone model.
•Persistent notes reduce tokens and turns on second runs while slightly improving success, showing true cross-session learning.
•Overall, the paper proves that smart scaffolding around the model can matter as much as the model itself for real-world software engineering.

Why This Research Matters

Big software powers everyday life—from phone apps to school websites and hospitals. Fixing and improving that software faster and more safely helps everyone. This work shows that carefully organizing how an AI thinks, remembers, and uses tools can beat simply buying a bigger model. The approach also makes agents more understandable for people and more controllable for builders, which builds trust. Over time, reusable notes and smarter tools mean cheaper runs and fewer mistakes. That’s a path toward AI teammates that truly help professional developers deliver reliable software at scale.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how cleaning your whole room is harder than just picking up one toy, because there are many places to look and lots of steps to keep track of? Big software projects are like giant rooms with thousands of toys.

🥬 Filling (The Actual Concept): Coding agents are AI helpers that read, write, and run code to fix bugs or add features.

What it is: A coding agent is an AI that can search files, edit code, run tests, and decide what to try next.
How it works: 1) Read the problem. 2) Find the right files. 3) Propose code edits. 4) Run commands/tests. 5) Learn from results. 6) Repeat until tests pass.
Why it matters: Without an agent, a human has to do all steps by hand and may forget details or repeat mistakes on huge codebases.

🍞 Bottom Bread (Anchor): Imagine asking an AI to fix a bug in a library like PyTorch; it has to hunt across many files, change a few lines, run tests, and try again until green.

🍞 Top Bread (Hook): Imagine trying to follow a very long comic book story where the clues are spread across many issues; you can forget what happened earlier.

🥬 Filling (The Actual Concept): Long-context reasoning means keeping track of lots of scattered information over many steps.

What it is: The agent must connect details from many files, tool outputs, and past choices over a long time.
How it works: It collects clues, summarizes the key parts, and reuses them later as it plans the next move.
Why it matters: Without it, the agent loses track, repeats work, or breaks earlier decisions.

🍞 Bottom Bread (Anchor): To fix a bug that touches ten files, the agent needs to remember earlier edits when changing the last file so everything still fits together.

🍞 Top Bread (Hook): Think about studying for a math test—if you don’t keep notes, you might re-learn the same rule over and over.

🥬 Filling (The Actual Concept): Long-term memory lets an agent remember lessons across different tasks and days.

What it is: A place to store patterns, known errors, and fixes so the agent improves over time.
How it works: After a session, the agent writes down what worked, what failed, and why. Next time a similar error appears, it fetches the note and acts faster.
Why it matters: Without it, the agent repeats old mistakes and wastes tokens and time.

🍞 Bottom Bread (Anchor): If the agent once learned that a certain error message needs escaping asterisks, it can instantly apply the same fix months later.

🍞 Top Bread (Hook): Picture three friends planning a trip: one focuses on packing smart, one keeps everyone updated, and one handles the bookings.

🥬 Filling (The Actual Concept): The paper argues success needs a system view with three separate priorities: Agent Experience (AX), User Experience (UX), and Developer Experience (DX).

What it is: AX is the agent’s clean thinking space, UX is what humans see and control, and DX is how builders observe and improve the agent.
How it works: Keep the agent’s inputs structured and compact (AX), show humans rich but readable logs (UX), and give devs modular, testable parts (DX).
Why it matters: Without this separation, prompts get messy, logs confuse the model, and developers can’t easily swap or test parts.

🍞 Bottom Bread (Anchor): Users watch clear diffs and progress, the agent sees only the essential summary, and developers can change the file-edit tool without rewriting the whole system.

The world before: Early LLMs wrote short code snippets, then got better at completing functions, and now attempt full real-world fixes. But big repositories overwhelm flat chat histories and simple tool scripts. The problem: Agents hit two hard walls—(C1) long-context reasoning and (C2) long-term memory. Failed attempts: Just making context windows bigger or throwing in more heuristic prompts isn’t enough; important details are dropped or logs bloat the prompt. The gap: A principled scaffold—how the agent organizes context, remembers across sessions, and uses tools—was missing. Real stakes: Faster bug fixes and safer changes matter for your favorite apps, school websites, and even hospitals’ software; getting this right saves time, money, and headaches for everyone.

02Core Idea

🍞 Top Bread (Hook): Imagine building a treehouse that stays sturdy because you separate the blueprint, the toolbox, and the daily to-do list.

🥬 Filling (The Actual Concept): The key insight in one sentence: Separating Agent Experience, User Experience, and Developer Experience—and backing them with structured memory, modular tools, and an auto-improving meta-agent—makes coding agents reliably scale to real codebases.

How it works: 1) Keep AX clean with hierarchical memory and context compression. 2) Give UX clear, human-friendly traces and notes. 3) Power DX with modular extensions and a meta-agent that builds, tests, and improves the agent. 4) Reuse persistent notes across sessions for compounding gains.
Why it matters: Without this split, the model reads human logs like noise, tools get tangled, and teams can’t iterate fast.

🍞 Bottom Bread (Anchor): The agent sees a tidy summary of goals and errors; people see full diffs; developers quickly swap in a safer file-edit tool and retest—all without breaking each other’s workflows.

🍞 Top Bread (Hook): You know how three different maps—bird’s-eye, street view, and step-by-step directions—help you navigate a city better?

🥬 Filling (The Actual Concept): AX/UX/DX are three maps for the same journey.

What it is: AX is the model’s map, UX is the traveler’s view, DX is the engineer’s dashboard.
How it works: Each map shows only what its user needs, with bridges between them.
Why it matters: Mixing them blurs details and causes wrong turns.

🍞 Bottom Bread (Anchor): Instead of dumping the entire travel blog into the GPS, you give the GPS a clean route, keep the blog for readers, and let mechanics tune the engine.

🍞 Top Bread (Hook): Imagine compressing a long movie into a tight trailer that still tells the story.

🥬 Filling (The Actual Concept): Context management keeps long reasoning on track.

What it is: A hierarchical working memory plus an Architect Agent that compresses history into structured summaries when it gets too long.
How it works: Detect when the context is near limits, summarize goals, decisions, errors, and TODOs, replace old spans with the summary, and keep a small recent window raw.
Why it matters: Without it, the agent forgets early choices or overflows the context.

🍞 Bottom Bread (Anchor): A 50-step debugging session still remembers the early design decision because it’s preserved in a compact “plan + errors + TODOs” block.

🍞 Top Bread (Hook): Think of a science notebook that records not just answers but failed experiments and why they failed.

🥬 Filling (The Actual Concept): Note-taking creates durable, human-readable memory across sessions.

What it is: An async note-taking agent distills full runs into Markdown notes, including hindsight on failures.
How it works: After each session, it writes tagged notes into a folder tree, so future runs can retrieve exact fixes.
Why it matters: Without it, the same error wastes time again.

🍞 Bottom Bread (Anchor): A future bug with the same stack trace instantly pulls up the note with the best-known fix.

🍞 Top Bread (Hook): Picture a plug-and-play robot arm that can swap tools without rebuilding the robot.

🥬 Filling (The Actual Concept): Extensions modularize tool use, parsing, and prompt shaping.

What it is: Typed callbacks that parse model outputs, run tools (like bash and file edit), and summarize results into memory.
How it works: The orchestrator stays tiny; extensions handle capabilities; developers enable/disable or refine them independently.
Why it matters: Without modularity, small tool bugs ripple through the whole agent.

🍞 Bottom Bread (Anchor): Upgrading the file editor to safer matching rules improves reliability everywhere with no loop changes.

🍞 Top Bread (Hook): Imagine a coach who watches the game tape, updates the playbook, and trains the team before the next match.

🥬 Filling (The Actual Concept): The Meta-agent automates a build-test-improve loop.

What it is: An agent that constructs, evaluates, and refines other agents’ prompts, tool wiring, and guardrails.
How it works: Generate config → run tests → spot failures → patch prompts/tools → retest until stable.
Why it matters: Without it, improving agents is slow, manual, and brittle.

🍞 Bottom Bread (Anchor): A confusing file-match error is turned into an actionable message that forces safe edits, because the meta-agent saw failures and rewrote the prompt accordingly.

Before vs After:

Before: Flat chat logs, brittle prompts, and tool spaghetti made agents stall on big repos.
After: Compact plans, persistent notes, and modular tools let the same backbone model punch above its weight.

Why it works (intuition): Structure filters noise. By feeding the model only what it needs (AX), showing humans rich but separate traces (UX), and giving builders legos (DX), the system protects reasoning and speeds iteration. Notes and compression keep the long story coherent; the meta-agent keeps the toolkit sharp.

Building blocks: AX (hierarchical memory + compression), UX (transparent logs + artifacts), DX (extensions + meta-agent), all tied by a simple orchestrator.

03Methodology

At a high level: Input (repo + issue) → Orchestrator loop → Extensions execute actions → Memory updated (working + notes) → Context compressed when needed → Patch produced and validated.

🍞 Top Bread (Hook): Think of a careful chef who tastes, adjusts seasoning, and tries again until the dish is perfect.

🥬 Filling (The Actual Concept): The Confucius Orchestrator is a tiny loop that reads the model’s plan, runs tools, saves results, and repeats.

What it is: A reusable control loop that calls the LLM, parses actions, routes them to extensions, and logs outcomes.
How it works: 1) Initialize memory and extensions. 2) Prompt the LLM. 3) Parse tool calls (native or XML tags). 4) Execute via extensions (bash, file edit, search). 5) Store observations. 6) Check stop conditions or continue.
Why it matters: Without a clean loop, complex behavior becomes untestable and unsafe.

🍞 Bottom Bread (Anchor): The LLM says “open file X and replace lines Y–Z,” the file-edit extension applies a diff, the bash tool runs tests, results go into memory, and the loop repeats.

🍞 Top Bread (Hook): Imagine a backpack with labeled pockets so you can always find your pencil, snacks, and notebook.

🥬 Filling (The Actual Concept): Hierarchical working memory organizes the agent’s short-term thoughts.

What it is: A tree of Markdown nodes (e.g., analysis.md, todo.md) with tags and scopes.
How it works: The agent writes and searches these nodes during runs; important insights survive truncation.
Why it matters: Without structure, you either overflow context or lose key decisions.

🍞 Bottom Bread (Anchor): A “qutebrowser_process_cleanup” folder keeps analysis, summary, and TODOs, so later steps fetch exactly what they need.

🍞 Top Bread (Hook): Think of turning a long class discussion into a one-page study guide before exams.

🥬 Filling (The Actual Concept): Context compression triggers when history gets long.

What it is: An Architect Agent summarizes goals, decisions, errors, and TODOs.
How it works: When near limits, it replaces big chunks with a compact plan while keeping recent raw turns.
Why it matters: Without it, the model either forgets or can’t fit all steps.

🍞 Bottom Bread (Anchor): After 30 tool calls and several diffs, the agent still remembers the agreed strategy and open TODOs because they’re in the compressed summary.

🍞 Top Bread (Hook): Picture a diary that captures what worked and what didn’t so future you avoids the same mistakes.

🥬 Filling (The Actual Concept): The note-taking agent writes persistent, tagged Markdown after runs.

What it is: Cross-session, human-readable notes with success paths and hindsight on failures.
How it works: It distills the trajectory into reusable patterns and indexes by messages, traces, and components.
Why it matters: Without notes, second attempts cost the same tokens and time as the first.

🍞 Bottom Bread (Anchor): Next time an asterisk-as-wildcard bug appears, the agent finds the exact note and applies the fix immediately.

🍞 Top Bread (Hook): Think of snap-on tools for a power drill—you attach what you need and keep the handle the same.

🥬 Filling (The Actual Concept): Extensions provide modular tool use, parsing, and prompt shaping.

What it is: Typed callbacks like on_llm_output and on_input_messages that parse actions, run tools, and summarize outcomes.
How it works: The orchestrator stays stable; developers swap extensions (file search/edit, CLI) or tweak rules safely.
Why it matters: Without modularity, new tools can break everything.

🍞 Bottom Bread (Anchor): Replacing naive file edits with a safer matcher (and actionable failure prompts) cuts retries and raises pass rates.

🍞 Top Bread (Hook): Imagine a sports coach who keeps refining plays after every scrimmage.

🥬 Filling (The Actual Concept): The Meta-agent runs a build-test-improve loop to tune the agent itself.

What it is: An agent that designs configs, wires tools, sets prompts, tests on tasks, and patches weaknesses.
How it works: Synthesize agent → run on a test set → detect brittle patterns → rewrite prompts/tool configs → rerun; repeat until stable.
Why it matters: Without automated iteration, improving complex scaffolds is slow and ad hoc.

🍞 Bottom Bread (Anchor): The meta-agent upgraded a vague file-edit error into a precise instruction that forces safe, exact-line fixes, reducing derailments.

Secret sauce:

Separation of concerns (AX/UX/DX) protects reasoning quality while keeping humans informed and developers agile.
Structured memory and triggered compression keep long-horizon plans intact.
Meta-learned tool-use policies fix the “last mile” errors that otherwise sink success on real repos.

Example flow with actual data:

Input: “Fix function F so archiveDataType can be null.”
Steps: search TypeRefs.ts → edit types to allow null → edit BlobAccessTokenFacade.ts signatures → edit EntityRestClient.ts call site → remove unused imports → run reproduction script → commit if green.
What breaks without steps: If you skip type edits, the compiler fails; if you forget the call site change, tests fail; if you don’t run the script, regressions slip through.

04Experiments & Results

🍞 Top Bread (Hook): You know how a spelling bee proves who can spell best under pressure? Benchmarks do that for coding agents.

🥬 Filling (The Actual Concept): SWE-Bench-Pro is a tough test with real GitHub issues where an agent wins by making a patch that passes all tests.

What it is: A benchmark of 700+ real tasks from real repos, designed for long, multi-file fixes.
How it works: The agent edits the repo; success means the project’s own tests pass with no human help.
Why it matters: Without realistic tests, we might reward agents that only look smart but can’t ship working code.

🍞 Bottom Bread (Anchor): It’s like fixing a real bug in a famous library; only green tests count as a win.

🍞 Top Bread (Hook): Imagine a report card that shows not just your grade but also how your classmates did on the same exam.

🥬 Filling (The Actual Concept): Resolve Rate (Pass@1) is the percentage of tasks fixed on the first try.

What it is: A clean metric—higher is better, like a higher batting average.
How it works: Run each task once per seed; count how many passes; average across runs.
Why it matters: Without a strict metric, we can’t compare fairly.

🍞 Bottom Bread (Anchor): Scoring 59% means CCA gets an A when many others are around a B.

Main results on SWE-Bench-Pro (public split):

Same environments, same backbones, different scaffolds.
CCA beats SWE-Agent across models; with Claude 4.5 Sonnet it hits 52.7%, above Live-SWE-Agent’s 45.8%.
With GPT-5.2, CCA reaches 59.0%, surpassing the reported OpenAI scaffold result and setting a new leading number under identical conditions.
With Claude 4.5 Opus, CCA hits 54.3%, above Anthropic’s reported number.

Ablations that make numbers meaningful:

Context management on/off: On Claude 4 Sonnet, advanced context management lifts a subset from 42.0 to 48.6 (like turning a D+ into a solid C+ on hard problems).
Tool-use sophistication: Disabling the meta-learned tool rules drops performance notably even when memory is strong—showing both parts matter.

Surprising and useful findings:

Strong scaffolding + mid-tier model can beat a stronger model with weaker scaffolding (e.g., Claude 4.5 Sonnet + CCA > Claude 4.5 Opus + proprietary scaffold).
Persistent notes help: A second run with notes used fewer tokens (−11k), fewer turns (−3), and slightly higher resolve rate (+1.4 points), proving practical cross-session learning.
Robustness vs. number of edited files: Performance degrades only moderately as edits increase; structured memory helps multi-file refactors.

Other evaluations:

SWE-Bench-Verified: With Claude 4 Sonnet, CCA hits 74.6%, above leading open-source scaffolds under the same backbone, and above a mini-SWE-Agent even when that variant used a stronger backbone.

Takeaway with context: The big jumps don’t come from swapping in a fancier LLM; they come from better orchestration, memory, and tool rules that let the same LLM do more, more reliably.

05Discussion & Limitations

🍞 Top Bread (Hook): Even the best backpack can’t carry everything, and even the best map can miss a street.

🥬 Filling (The Actual Concept): Limitations, resources, and open questions help us use CCA wisely.

What it is: An honest look at where CCA shines and where it needs care.
How it works: Identify constraints (compute, model APIs), tricky scenarios (non-deterministic builds), and research gaps (full RL integration, richer verification).
Why it matters: Knowing edges prevents misuse and guides the next improvements.

🍞 Bottom Bread (Anchor): You wouldn’t take a scooter on a rocky mountain trail; similarly, don’t throw CCA at flaky environments without guardrails.

Limitations:

Dependent on the underlying LLM’s reasoning and tool-use compliance; weaker models may yield weaker summaries and edits.
Very flaky or non-reproducible repos (e.g., networked tests, hidden secrets) can mislead the loop.
Notes quality depends on summarization fidelity; poor hindsight notes help less.
Multi-language or polyglot repos beyond the current tools may require extra extensions and tuning.

Required resources:

Access to long-context, tool-capable LLMs; containerized execution with file and shell tools.
Storage for hierarchical working memory and note repositories.
CI-like runners for testing patches at scale; observability UI for traces and evals.

When NOT to use:

One-shot toy tasks where a single prompt or code-completion suffices.
Locked-down environments with no safe way to run tests or apply diffs.
Highly stateful systems where reliable reproduction isn’t possible.

Open questions:

How far can reinforcement learning push agent policies given Confucius-style trajectories and rewards?
What’s the best way to measure long-term memory quality across sessions and repos?
Can we auto-detect when to switch tool strategies (e.g., single-file vs. multi-file refactor modes)?
How to guarantee safety and rollback across complex edits while preserving speed?
Can meta-agents co-design new tools autonomously while proving gains in controlled A/Bs?

06Conclusion & Future Work

Three-sentence summary: CCA shows that the structure around an LLM—its memory, tools, and development workflow—can be as important as the LLM itself for real-world software engineering. By separating Agent Experience, User Experience, and Developer Experience, and adding hierarchical memory, context compression, persistent notes, modular extensions, and a meta-agent, CCA scales to large codebases with clarity and control. Across benchmarks like SWE-Bench-Pro, these choices lift Resolve@1 to 59% under identical conditions, surpassing prior research and reported commercial setups.

Main achievement: Turning agent scaffolding into a principled, modular system that reliably boosts long-horizon reasoning and tool robustness—demonstrating that orchestration and memory design can beat simply swapping in a bigger model.

Future directions: Integrate RL on rich trajectories, design stronger verification and rollback tools for multi-file edits, expand cross-repo memory sharing while preserving privacy, and explore adaptive budgets that tune thinking and compression on the fly.

Why remember this: It reframes progress in coding agents from “get a bigger model” to “build a better brain around the model,” showing that careful structure—like study notes, clean summaries, and swappable tools—turns raw intelligence into dependable engineering power.

Practical Applications

•Automated bug fixing on large repositories with safe diffs and test verification.
•Refactoring multi-file features while preserving earlier design decisions.
•Creating a team knowledge base of past failures and fixes that the agent reuses.
•Rapidly adapting an agent to a new toolchain or CI environment via extensions.
•Onboarding new projects by having the meta-agent synthesize prompts and wiring.
•Reducing developer toil by running reproduced failures and proposing minimal patches.
•Stabilizing long debugging sessions through structured context compression.
•Auditing agent behavior with readable traces and artifact previews for code review.
•Benchmarking and A/B testing agent configurations to pick the best scaffold.
•Teaching junior developers with human-readable notes that explain common error patterns.

Version: 1