MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences
Key Summary
- •MemGovern teaches code agents to learn from past human fixes on GitHub by turning messy discussions into clean, reusable 'experience cards.'
- •It builds a big, organized memory (135,000 cards) that agents can search like a library and then read deeply when a card looks promising.
- •The memory has two parts: an Index Layer for finding similar problems and a Resolution Layer for explaining how people actually fixed them.
- •Instead of stuffing everything into the agent at once (like basic RAG), MemGovern lets the agent first search widely and then browse selectively.
- •This governed memory boosts bug-fixing on the SWE-bench Verified benchmark by an average of 4.65% across many different language models.
- •The improvements are strongest for weaker models, showing that governed human experience can level up a broad range of agents.
- •Bigger, better-quality memory helps more: performance rises as more experience cards are added and stays stable thanks to quality controls.
- •A checklist-based quality gate and a refine loop keep the cards accurate and useful, cutting noise from social chatter and irrelevant logs.
- •MemGovern is plug-and-play with existing agents (like SWE-Agent), adding a practical 'memory infrastructure' to real debugging workflows.
- •Main trade-off: a bit more token usage for searching, but the accuracy gains generally make the cost worthwhile.
Why This Research Matters
Software we use every day—games, school apps, banking tools—breaks less and gets fixed faster when developers and agents can reuse proven solutions. MemGovern turns scattered GitHub conversations into clean, reusable experience cards, so agents can find the right idea quickly and apply it safely. This saves developer time, lowers engineering costs, and reduces risky patches that cause new bugs. By helping weaker models improve the most, MemGovern makes AI debugging more accessible across teams and budgets. Over time, a governed memory also becomes a living knowledge base that organizations can maintain and trust. The result is a practical path to more reliable, maintainable software for everyone.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine you’re trying to fix a tricky Lego model. You could start from scratch every time, or you could look up how other kids solved the same problem before. Most of us would peek at those past solutions—because they save time and prevent mistakes.
🥬 The Concept (Large Language Models and Code Agents):
- What it is: Large Language Models (LLMs) are smart text tools, and when we give them the right computer tools, they become code agents that can read code, run tests, and try to fix bugs.
- How it works: (1) The agent reads the issue description, (2) searches the code, (3) tries a fix, and (4) runs tests to check if the bug is gone.
- Why it matters: Without help, these agents often act like students who refuse to check the answer key. They only look at the local code and miss the huge treasure trove of past human fixes on GitHub.
🍞 Anchor: Just like you’d search online for how others fixed a flat Lego wheel, developers search GitHub to see how someone else solved the same crash last year. That’s normal human practice—and agents should do it too.
🍞 Hook: You know how group chats can get noisy—lots of greetings, jokes, and side talk? GitHub issue threads are like that: plenty of useful clues mixed with a lot of chatter.
🥬 The Concept (Raw GitHub Experience):
- What it is: GitHub is full of real bug reports, discussions, and fixes that hide expert reasoning, but it’s messy and inconsistent.
- How it works: People open issues, discuss symptoms, link pull requests (PRs), and merge patches. The gold is there—but buried under social messages, long logs, and project-specific details.
- Why it matters: If an agent reads this raw data directly, it gets confused. It can’t easily find the root cause or the repair logic, and cross-project reuse becomes very hard.
🍞 Anchor: It’s like trying to learn a recipe from a 200-message chat where the actual steps are scattered between emojis and small talk.
🍞 Hook: Imagine two kinds of worlds. In a closed world, you only use what’s inside your backpack. In an open world, you can borrow from the whole library.
🥬 The Concept (Closed-world vs. Open-world Debugging):
- What it is: Closed-world agents only use local files and immediate context; open-world agents also use outside knowledge, like GitHub history.
- How it works: Closed-world: search locally, guess, patch, test. Open-world: search broadly, compare with past cases, transfer the winning idea, then patch and test.
- Why it matters: Without open-world experience, agents repeat old mistakes and reinvent the wheel, especially on tricky bugs.
🍞 Anchor: If you’re learning to skateboard, it’s faster to watch a how-to video than to fall 50 times figuring it out alone.
🍞 Hook: Think of a giant binder of winning plays for a sports team: the plays exist, but without tabs, summaries, and categories, you’ll never find the one you need in time.
🥬 The Concept (The Pre-MemGovern Problem):
- What it is: Previous systems tried to retrieve raw documents or simple patches, or only looked within one repository.
- How it works: Single-shot retrieval (RAG) often grabbed text that looked similar on the surface but wasn’t the same problem underneath.
- Why it matters: Surface-matching leads to bad advice (like mixing up front-end formatting with back-end logic), wasting tokens and harming patch quality.
🍞 Anchor: It’s like Googling “how to fix a squeak” and getting advice about a violin when you meant your bike.
🍞 Hook: You know how recipe cards make cooking easier because they show the dish name on top and the steps on the back?
🥬 The Concept (What Was Missing):
- What it is: A governance system to turn noisy GitHub history into clean, searchable, reusable experience that agents can safely use.
- How it works: (1) Select good sources, (2) standardize into a common format, (3) keep only high-quality, verified content, then (4) let agents search and browse like humans do.
- Why it matters: Without governance, memory turns into junk drawers; with governance, it becomes a well-labeled toolbox you can trust.
🍞 Anchor: Think of MemGovern as the librarian who organizes all the best bug-fix playbooks so you can quickly grab the right one.
🍞 Hook: Why should you care? Because every time software crashes—on your phone, in a game, or in your homework app—faster, safer fixes make your life smoother.
🥬 The Concept (Real Stakes):
- What it is: Faster, more accurate debugging means fewer app crashes, quicker updates, and safer systems.
- How it works: Agents reuse proven human fixes, cutting trial-and-error and reducing risky patches.
- Why it matters: This saves developer time, lowers costs, and gives users more reliable apps.
🍞 Anchor: It’s like having the world’s best help desk on speed dial whenever something breaks.
02Core Idea
🍞 Hook: Imagine a giant library of recipe cards where each card has a short title up front (so you can find it fast) and the cooking steps on the back (so you can do it right). Now imagine your cooking robot can search that library itself.
🥬 The Concept (The Aha! Moment):
- What it is: MemGovern’s key insight is to govern messy human debugging history into clean, two-layer “experience cards,” and to let agents search broadly first and read deeply later—just like skilled human engineers.
- How it works: (1) Curate and standardize GitHub issue–PR–patch history into cards with an Index Layer (symptoms for retrieval) and a Resolution Layer (root cause and fix strategy), (2) search across the Index Layers, (3) browse the most promising Resolution Layers, and (4) transfer the logic to the current code.
- Why it matters: Without this structure and search style, agents either drown in noise or miss the right lesson; with it, they borrow the exact reasoning they need.
🍞 Anchor: It’s like finding the right cooking card by its title (“chocolate chip cookies, chewy”) and then following the steps on the back to bake the perfect batch.
Multiple Analogies:
- Library Analogy: Index cards (book titles and tags) help you find likely books; then you open the book to learn the method. MemGovern gives agents both parts—searchable titles and the actual know-how.
- Sports Playbook: The cover shows the play’s situation (“3rd-and-long, blitz expected”); inside is the diagram and coaching tips. Agents pick the matching situation, then run the play.
- Detective Files: The folder tab lists the crime pattern; inside are the clues and the solution logic. The agent scans tabs first, then studies the most relevant file to solve today’s case.
Before vs. After:
- Before: Agents tried to fix bugs with only local clues or grabbed long, messy threads that “kind of matched,” leading to shallow or wrong patches.
- After: Agents retrieve by symptoms (Index Layer), then learn root-cause patterns and fix strategies (Resolution Layer), producing smarter, contract-respecting patches.
Why It Works (Intuition, not equations):
- Separate finding from fixing: searching benefits from short, general signals; fixing needs precise reasoning. Mixing them causes either missed matches or overload.
- Standardize the language: removing repo-specific names and chatter makes cross-project generalization possible.
- Govern quality: a checklist and refine loop keep the memory trustworthy, so agents don’t inherit bad habits.
- Agentic control: letting the agent alternate between “search wide” and “read deep” mirrors expert human workflows.
Building Blocks (each explained with the Sandwich pattern):
🍞 Hook: You know how you first scan a table of contents before reading a chapter? 🥬 The Concept (Index Layer):
- What it is: The front side of the card—normalized problem summary and reusable diagnostic signals (like error types or failing tests) for matching.
- How it works: Extract symptom-level terms, remove repo-specific details, make it easy to embed and compare.
- Why it matters: If the index is messy, you won’t find the right past case. 🍞 Anchor: Searching for “NullPointer on save in parser” is much better than matching a random line number in someone else’s project.
🍞 Hook: After picking the right chapter, you read the steps to actually do the thing. 🥬 The Concept (Resolution Layer):
- What it is: The back side—root cause, abstract fix strategy, and a patch digest that explains what changed and why.
- How it works: Boil down the human reasoning so it transfers across repositories.
- Why it matters: Without the reasoning, you can copy a patch but miss the principle. 🍞 Anchor: “Add boundary check for null inputs” is a strategy you can apply in any parser, not just the original one.
🍞 Hook: Think of two tools on your backpack: a flashlight to scan the room and a magnifying glass to inspect a clue. 🥬 The Concept (Dual-Primitive Interface: Searching and Browsing):
- What it is: Two actions for agents—Search (scan many Index Layers quickly) and Browse (open a chosen card’s Resolution Layer deeply).
- How it works: Search ranks candidates by symptom similarity; Browse reveals the human logic when the agent needs it.
- Why it matters: Mixing the two floods the agent with irrelevant text; separating them improves both speed and accuracy. 🍞 Anchor: You Google titles first, then open only the tabs that look right.
🍞 Hook: When solving a puzzle, you often try a guess, check it, and adjust. 🥬 The Concept (Progressive Agentic Search):
- What it is: A loop where the agent refines its query as it learns more and decides when to search again or browse deeper.
- How it works: Start with current symptoms → Search → skim candidates → Browse the best → map the strategy to your code → if unsure, refine the query and repeat.
- Why it matters: One-shot retrieval often fetches lookalikes that aren’t truly relevant. 🍞 Anchor: It’s like narrowing down from “sports shoes” to “trail running shoes, size 6, waterproof” until you find the perfect pair.
🍞 Hook: Not all advice online is good; you need fact-checking. 🥬 The Concept (Checklist-Based Quality Control):
- What it is: An LLM evaluator scores each card on key dimensions, gives feedback, and triggers fixes if needed.
- How it works: Score → if below threshold, refine only weak parts → recheck (up to 3 times).
- Why it matters: Prevents memory pollution and keeps trust high. 🍞 Anchor: It’s like a teacher marking a draft and asking you to fix just the confusing paragraph.
03Methodology
At a high level: Raw GitHub issues/PRs/patches → Experience Governance (select, standardize, quality control) → Experience Cards Library → Agentic Experience Search (search, browse, analogical transfer) → Patch and Verify.
Step-by-step with Sandwich explanations along the way:
- Input Collection
- What happens: Gather GitHub repositories with enough activity (stars, issues, PRs) and extract linked issue–PR–patch triplets.
- Why this exists: We need real, grounded human debugging history—not synthetic examples.
- Example: A Django issue describing a crash, a PR that fixes it, and the patch diff.
🍞 Hook: You wouldn’t learn from random notes taped on a fridge; you’d pick a well-kept binder. 🥬 The Concept (Repository Selection):
- What it is: Choose active, well-maintained repos using a balanced score from stars, issues, and PRs.
- How it works: Compute a weighted score (log-scaled) and pick top-M repositories.
- Why it matters: Better sources → denser, higher-quality experiences. 🍞 Anchor: Popular, busy projects are like classrooms with lots of good questions and solid answers.
🍞 Hook: A good mystery story has all the key scenes; if half the pages are missing, you can’t solve it. 🥬 The Concept (Instance Purification – “Closed-loop” records):
- What it is: Keep only issue–PR–patch triplets with clear links, parsable diffs, and diagnostic anchors (like stack traces), and filter out low technical-content threads.
- How it works: Require explicit linkage, validated diffs, and a minimum technical ratio; drop social chatter and procedural noise.
- Why it matters: Incomplete or chatty threads confuse retrieval and learning. 🍞 Anchor: Keep the full recipe with ingredients and steps; throw away the party chatter around it.
- Standardization into Experience Cards 🍞 Hook: Imagine turning messy chat logs into tidy recipe cards with a clear title and steps. 🥬 The Concept (Experience Cards):
- What it is: A unified schema with two layers: Index (for retrieval) and Resolution (for reasoning and action).
- How it works: Content purification removes non-technical bits; then we fill fields like Problem Summary, Signals, Root Cause, Fix Strategy, Patch Digest, and Verification.
- Why it matters: Standardization enables cross-repo matching and safe transfer of logic. 🍞 Anchor: Every card looks the same, so you can file, find, and reuse them.
🍞 Hook: First, read the label on the drawer; then, use what’s inside. 🥬 The Concept (Index Layer vs. Resolution Layer):
- What it is: Index Layer = how we find similar problems; Resolution Layer = how we fix them.
- How it works: Index stores generalizable symptoms; Resolution stores root-causes and strategies.
- Why it matters: Search needs broad signals; fixing needs precise reasoning. 🍞 Anchor: The drawer label says “screwdrivers,” and inside are the instructions for tricky screws.
🍞 Hook: Quality beats quantity when you’re building a first-aid kit. 🥬 The Concept (Checklist-Based Quality Control):
- What it is: An LLM evaluator uses a checklist to score each card and triggers a refine loop if needed (up to 3 iterations).
- How it works: Score → feedback → targeted regeneration → rescore.
- Why it matters: Prevents hallucinations and missing steps from entering the memory. 🍞 Anchor: Like a safety inspector signing off each emergency plan.
- Building the Memory Bank
- What happens: After deduplication and checks, we keep 135K high-quality experience cards.
- Why this exists: Scale increases the chance of finding a good analog, while governance preserves signal.
- Example: Cards covering null checks, boundary cases, API contracts, serialization, threading, etc.
- Agentic Experience Search at Run Time 🍞 Hook: Use a flashlight first (scan widely), then a magnifying glass (read deeply). 🥬 The Concept (Dual-Primitive Interface: Searching and Browsing):
- What it is: Searching ranks candidates by Index similarity; Browsing opens a card’s Resolution Layer to see the human reasoning.
- How it works: Embedding-based similarity for Search; detail reveal for Browse.
- Why it matters: Keeps context small until the agent is confident a card is relevant. 🍞 Anchor: Skim headlines; read only the best-fit articles.
🍞 Hook: Solving puzzles means guess-check-iterate. 🥬 The Concept (Progressive Agentic Search):
- What it is: A loop of query formulation, retrieval, selective browsing, and analogical transfer.
- How it works: Extract keywords (symptoms, failing tests, stack traces), search Top-K, browse top hits, map Root Cause → Fix Strategy → Validation to the local code; refine queries if needed.
- Why it matters: Avoids being trapped by a bad first guess; steadily homes in on the right logic. 🍞 Anchor: From “weird crash” to “TypeError on None input in parser when option flag X is true.”
🍞 Hook: Copying a solution is easy; transferring the idea is harder—but more powerful. 🥬 The Concept (Analogical Transfer):
- What it is: Map the abstract pattern from the card (root cause → modification → verification) to your repository’s files, APIs, and names.
- How it works: Keep the principle (e.g., add guard for null) but adapt specifics (variable names, function signatures, tests).
- Why it matters: Prevents brittle copy-paste and respects local API contracts. 🍞 Anchor: Different ovens, same cookie recipe—adjust the temperature and tray position, not the ingredients.
- Output: Patch and Verification
- What happens: The agent proposes a patch aligned with the strategy and runs tests as described in the Verification field.
- Why this exists: Verification guards against shallow fixes and ensures correctness.
- Example: Instead of bypassing types, add the specific guard the card recommends and confirm with unit tests.
The Secret Sauce:
- Two-layer cards separate search signals from reasoning, improving both retrieval and transfer.
- Governance filters noise and standardizes language so cross-repo reuse works.
- Agentic search decouples breadth (search many) and depth (browse a few), cutting context bloat while lifting accuracy.
- A large, high-quality memory (135K cards) increases coverage without sacrificing trust.
04Experiments & Results
🍞 Hook: If two basketball teams both practice, the real test is game day. For code agents, game day is SWE-bench Verified.
🥬 The Concept (SWE-bench Verified):
- What it is: A benchmark of 500 real GitHub issues where the agent must fix bugs and pass developer-written tests.
- How it works: Each task gives an issue description and a self-contained repo; the agent proposes a patch; tests decide pass/fail.
- Why it matters: It’s a fair, repeatable way to measure real-world bug fixing. 🍞 Anchor: Like a science fair where judges run your experiment to see if it really works.
The Test
- What they measured: Resolution Rate—the percentage of issues the agent truly fixes (tests pass).
- Why this metric: It captures end-to-end success: understanding the bug, editing code, and verifying with tests.
The Competition
- Baseline: SWE-Agent (strong open-source agent backbone) without governed memory.
- Other baselines: AutoCodeRover, CodeAct, SWESynInfer, and more, plus many different LLM backbones (open and closed).
The Scoreboard (with context)
- Main result: MemGovern improved resolution rates by an average of 4.65% over SWE-Agent across many LLMs.
- Context: A +4.65% average here is like moving a whole letter grade up when everyone else is stuck; on hard, real-world issues, that’s significant.
- Notable boosts: Weaker backbones gained the most (e.g., +9.4% with GPT-4o), showing governed experience helps level up models that struggle.
- Token/cost trade-off: There’s extra token use for searching/browsing, but the accuracy gains generally outweigh the added cost.
Surprising Findings 🍞 Hook: Sometimes, getting more books doesn’t help if they’re messy. But when they’re well-organized, more is better. 🥬 The Concept (Memory Size Effect):
- What it is: Testing 10% to 100% of the 135K-card memory.
- How it works: Larger memory → higher chance of finding relevant analogs.
- Why it matters: Gains rose steadily (not just from a few “magic cards”), proving broad, governed coverage helps. 🍞 Anchor: A bigger, well-labeled library beats a tiny shelf every time.
🍞 Hook: Reading raw chat logs vs. curated notes—guess which one helps on test day? 🥬 The Concept (Memory Quality Effect):
- What it is: Compare raw PR+patch data vs. fully governed cards at the same size.
- How it works: Raw data sometimes helps but is unstable; governed cards give consistent improvements.
- Why it matters: The win comes from governance, not just feeding the agent more text. 🍞 Anchor: Neat class notes trump messy transcripts when you’re studying.
🍞 Hook: One big scoop vs. small smart bites. 🥬 The Concept (RAG vs. Agentic Search):
- What it is: Compare Single-shot RAG, Adaptive RAG (triggered mid-run), and Agentic Search (search wide, then browse selectively).
- How it works: Agentic Search outperformed both RAG styles across models.
- Why it matters: Decoupling breadth (candidate discovery) from depth (evidence grounding) filters noise and boosts patch quality. 🍞 Anchor: Skimming headlines first, then reading the two best articles, beats dumping ten random pages into your notebook.
🍞 Hook: How many search results should you open? Ten tabs or two? 🥬 The Concept (Top-K Sensitivity):
- What it is: Vary the number of retrieved candidates.
- How it works: Small K → gains improve as K grows; after a moderate K, benefits plateau; performance stays stable thanks to selective browsing.
- Why it matters: MemGovern is robust and efficient—no need to over-open tabs. 🍞 Anchor: After the first few good leads, extra tabs are mostly repeats.
Behavioral Insights
- With MemGovern, agents spent less time wandering (info gathering dropped) and more time testing wisely, making fewer blind edits.
- Case study: Instead of a “defensive bypass” that breaks API contracts, MemGovern guided a principled fix (e.g., validating inputs properly), leading to correct, maintainable patches.
Bottom line: Turning noisy human history into governed, searchable experience cards—and letting agents search then browse—consistently lifts real bug-fixing performance.
05Discussion & Limitations
Limitations
- Token overhead: Searching and browsing memory adds tokens. Although accuracy rises, some scenarios may be highly cost-sensitive.
- Upstream dependency: Card quality depends on the governance pipeline (selection, purification, checklist refinement). If these steps weaken, memory reliability can drop.
- Domain shift: The memory is built from open-source GitHub projects. Very niche or proprietary codebases may see fewer close analogs.
- Freshness: As projects evolve, old fixes may become outdated; periodic re-governance is needed to stay current.
Required Resources
- Compute for building memory: Running LLMs to purify content, standardize cards, and perform quality checks.
- Storage and indexing: Holding and embedding 135K cards with fast similarity search.
- Runtime budget: Enough tokens and latency tolerance for search→browse loops.
When NOT to Use
- Ultra-simple bugs: If a quick local fix is obvious, memory search may be unnecessary overhead.
- Highly proprietary patterns: If APIs or architectures are unique and rarely mirrored in open source, analogies may be thin.
- Strict cost/latency caps: If your environment cannot afford any token overhead, a minimal pipeline may be preferable.
Open Questions
- Compression: How can we shrink Index/Resolution layers further without losing transferability?
- Continual updates: What’s the best way to auto-refresh cards as repos evolve?
- Learning from outcomes: Can we close the loop so agents add new, verified experience back into the memory—safely?
- Personalization: How can we adapt retrieval/browsing depth to a specific codebase’s style and constraints?
- Training synergy: Beyond plug-in retrieval, how well does fine-tuning on governed cards improve base models’ repair priors?
06Conclusion & Future Work
Three-Sentence Summary MemGovern turns messy GitHub histories into clean, two-layer experience cards and gives agents a smart way to search widely and then read deeply. This governed memory infrastructure raises real bug-fixing accuracy on SWE-bench Verified by an average of 4.65% across many language models. By separating how we find similar issues (Index) from how we fix them (Resolution), agents transfer human debugging logic reliably across repositories.
Main Achievement A scalable, plug-and-play memory governance system (135K experience cards) plus an agentic search workflow that consistently outperforms traditional retrieval approaches and baseline agents.
Future Directions Improve card compression to further cut token costs; automate continual refresh of cards as code evolves; integrate outcome-aware feedback so agents can safely contribute new experience; explore fine-tuning models on governed cards to internalize robust repair strategies.
Why Remember This MemGovern shows that agents get much smarter when they learn from the world’s collective debugging experience—if that experience is governed, standardized, and searched agentically. It’s a blueprint for turning unstructured human history into practical, reliable memory that makes software safer, faster, and more maintainable.
Practical Applications
- •Integrate MemGovern as a plug-in to an existing SWE-Agent setup to boost bug-fixing accuracy on real repositories.
- •Use the repository selection score to curate high-signal sources when building internal debugging memories.
- •Adopt the two-layer card schema (Index/Resolution) to standardize incident reports inside an engineering org.
- •Enable agentic search (search wide, browse deep) in your dev tools to reduce prompt bloat and improve precision.
- •Run the checklist-based quality gate to keep postmortems and fix logs consistent, verifiable, and reusable.
- •Tune Top-K retrieval for your codebase to balance breadth of candidates with browsing cost.
- •Create team-specific memories by adding private, governed cards from internal issue trackers.
- •Use cards’ Verification fields to auto-generate regression tests and strengthen CI pipelines.
- •Train junior engineers by browsing cards that explain root causes and trade-offs behind common fixes.
- •Leverage analogical transfer guidance to enforce API contracts and avoid “defensive bypass” anti-patterns.