Closing the Loop: Universal Repository Representation with RPG-Encoder

Jane Luo; Chengyu Yin; Xin Zhang; Qingtao Li; Steven Liu; Yiming Huang; Jie Wu; Hao Liu; Yangyu Huang; Yu Kang; Fangkai Yang; Ying Xin; Scarlett Li

Closing the Loop: Universal Repository Representation with RPG-Encoder

Intermediate

Jane Luo, Chengyu Yin, Xin Zhang et al.2/2/2026

arXiv PDF

Key Summary

•The paper introduces RPG-Encoder, a way to turn a whole code repository into one clear map that mixes meaning (semantics) with structure (dependencies).
•It closes a 'reasoning loop' by letting AI compress code back into intent and also expand intent into working code.
•RPG-Encoder lifts function and file meanings, builds a tidy hierarchy, and anchors it to real files and call/import links.
•It updates fast by only touching parts that changed in a commit, cutting maintenance cost by 95.7%.
•On SWE-bench Verified, it reaches 93.7% Acc@5 for function-level localization, beating strong baselines by big margins.
•On RepoCraft, it reconstructs repositories with 98.5% coverage and much higher tested correctness than documentation alone.
•An agent can search, fetch exact code, and traverse the graph to follow call chains using one unified interface.
•Ablations show both meaning (features) and structure (dependencies) are necessary; removing either hurts accuracy.
•It’s efficient: fewer steps and lower cost per correct localization than prior methods.
•This unified map helps real teams find bugs faster, keep docs in sync, and rebuild or refactor large codebases more safely.

Why This Research Matters

Codebases keep growing, and teams need to find the right place to change without reading thousands of lines first. RPG-Encoder gives AI the same kind of map an expert engineer holds in their head: what each part is for and how it connects. That leads to faster bug fixes, safer refactors, and clearer onboarding for new teammates. Because it updates incrementally, companies can keep the map fresh without huge costs every week. It also enables reconstruction: if you need to rebuild modules in order or validate architecture after a refactor, the map becomes a blueprint. In practice, this saves engineering time, reduces production risk, and helps keep documentation aligned with reality.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your school has a giant library with books scattered everywhere, some sorted by topic and some by who wrote them. Finding the exact page you need is slow and confusing.

🥬 The Concept: Code repositories are like those giant libraries. They are folders full of files, classes, and functions that all connect in many ways. How it works:

Developers write code that lives in files and folders.
Functions call each other, and files import each other, forming a web of connections.
To fix a bug or add a feature, you must find the right spot in that web. Why it matters: Without a good map, even smart tools get lost and waste time reading irrelevant pieces. 🍞 Anchor: If a calculator app shows the wrong average, you need to find the exact function computing it, not just skim every math file.

🍞 Hook: You know how a recipe card explains the goal ('bake cookies') without showing every tiny stir? That’s like API documentation.

🥬 The Concept: API documentation is a text guide describing what parts of a codebase do. How it works:

It explains functions and classes in human-friendly language.
It lists parameters and return values.
It helps you understand intent. Why it matters: Without docs, you might not know which tool to use, but with only docs, you don’t see how pieces truly connect in the code. 🍞 Anchor: A doc might say 'validate token' exists, but not where it’s used throughout the login flow.

🍞 Hook: Imagine a subway map that shows which stations connect but doesn’t say what’s interesting at each stop.

🥬 The Concept: A dependency graph shows structural connections like imports and calls between code pieces. How it works:

Parse code to find 'who calls whom' and 'who imports what'.
Draw nodes (files/functions) and edges (relationships).
Let you follow execution paths. Why it matters: Without semantics, you may know paths but not purposes, making it hard to choose the right path. 🍞 Anchor: You can see function A calls function B, but you may not know B’s job—logging? math? parsing?

🍞 Hook: Think of trying to solve a maze with two half-maps: one shows landmarks, the other shows only paths. You need both on one sheet.

🥬 The Concept: The reasoning gap is the disconnect when tools use either docs (semantics) or graphs (structure), but not both together. How it works:

Documentation gives intent but weak global navigability.
Dependency graphs give structure but weak meaning.
Tools guess the missing half and make errors. Why it matters: Without alignment, agents wander or misread code, slowing fixes and risking wrong changes. 🍞 Anchor: A bug report says 'normalize SVM votes'; the graph shows many SVM functions, but without semantics you don’t know which one’s about normalization.

🍞 Hook: Imagine breathing in and out. Writing code is like breathing out (expanding ideas), and understanding code is like breathing in (compressing details).

🥬 The Concept: The closed loop says generation expands intent into code, while comprehension compresses code back into intent. How it works:

Generation: Start with goals, produce files and functions that match.
Comprehension: Read code, infer goals and roles.
A shared representation should support both directions. Why it matters: Without a closed loop, docs drift, graphs go stale, and reasoning breaks. 🍞 Anchor: If you can rebuild a repo from its map and also make the map from the repo, your map is probably faithful.

🍞 Hook: Think of a city map that labels neighborhoods by purpose (shops, parks) and also shows all the roads between them.

🥬 The Concept: The Repository Planning Graph (RPG) is a dual-view map where each node has meaning (what it does) and metadata (where it lives), and edges are both functional (hierarchy) and dependency (calls/imports). How it works:

Nodes pair a short purpose description with code attributes (like file path).
Functional edges build an intent-based hierarchy (e.g., 'Preprocessing' > 'Normalization').
Dependency edges connect who-calls/ who-imports whom. Why it matters: With only docs or only structure, agents get lost; RPG aligns both so navigation is precise and meaningful. 🍞 Anchor: In scikit-learn, 'Algorithms/Classification' sits above specific files; dependency edges then show which helpers those files call.

The World Before: Tools were split—documentation-heavy approaches knew 'what' but lost their way, while graph-heavy approaches knew 'how' pieces touched but not 'why'. The Problem: Agents couldn’t reliably jump from a bug description to the exact function to edit without reading mountains of code. Failed Attempts: Long-context summaries missed fine details; plain graphs ignored intent; keeping either up-to-date was costly and drifted. The Gap: We needed one evolving map that fuses meaning with structure and supports both understanding and building. Real Stakes: Faster bug fixes, safer refactors, better onboarding, and reliable auto-generation affect apps we use daily—from web logins to data analysis tools.

02Core Idea

🍞 Hook: You know how the best treasure maps show both the landmarks ('big oak tree') and the paths between them? That’s how you find the chest fast.

🥬 The Concept: Aha! Unify meaning and structure in one evolving map so AI can go from intent to code and back again. How it works:

Lift code into concise semantic features for functions/files (the 'landmarks').
Organize those into a functional hierarchy (the 'neighborhoods').
Ground them to real files and dependency edges (the 'roads').
Keep the map fresh with small updates when commits land (no full rebuild).
Let agents search, fetch, and traverse this map as one interface. Why it matters: Without a unified, up-to-date map, agents read too much, miss key spots, or follow the wrong trail. 🍞 Anchor: A bug that says 'fix SVM vote normalization' quickly narrows to the exact function, its helpers, and where to change logic.

Three Analogies:

Museum Guide: Docents (semantics) tell stories; floor plans (structure) show rooms. RPG-Encoder merges both into an audio-guide map that tells you what’s in each room and how to get there.
GPS + Points of Interest: GPS shows streets (dependencies); POIs show what places are (semantics). Together they route you to the right cafe quickly.
Lego Instructions: Steps show assembly order (dependencies); callouts say what each sub-build is for (semantics). Combining both prevents wrong builds.

🍞 Hook: Think of stacking boxes from big to small so you can always find the right size quickly.

🥬 The Concept: Dual-view alignment means every piece of code lives in an intent hierarchy and a dependency web at the same time. How it works:

Each node has a short 'what it does' feature and code metadata.
Functional edges form categories and subcategories.
Dependency edges record calls/imports for execution flow. Why it matters: If you lose either view, you lose either purpose or pathway—both are needed for precise fixes. 🍞 Anchor: 'Data Preprocessing/Normalization' leads to files; dependency edges then reveal which math utilities those files call.

🍞 Hook: You know how you can sum up a movie plot in a few sentences to remember it later?

🥬 The Concept: Semantic lifting turns verbose code into short, normalized behaviors like 'validate token' or 'compute average'. How it works:

Parse functions/classes as units.
Extract verbs + objects that state purpose, not implementation.
Summarize files from their functions. Why it matters: Short, consistent behavior tags let AI match natural-language bug reports to exact code units. 🍞 Anchor: From 'def check_increasing(x, y):', we store 'check monotonic trend' so a query about 'monotonic check' finds it fast.

🍞 Hook: Imagine books sorted by what they’re about, not just by shelf number.

🥬 The Concept: Hierarchical aggregation groups file features into a tidy tree of functional areas, categories, and subcategories. How it works:

Discover top-level areas (e.g., 'Preprocessing', 'Algorithms').
Route features under the best-fitting branches.
Insert intermediate nodes when needed for granularity. Why it matters: A clean tree shrinks search from thousands of functions to the right small corner. 🍞 Anchor: 'Algorithms/Classification/SVM' collects the exact files and functions for SVM classification.

🍞 Hook: A map label like 'Playground' is only useful if you know which park it sits in.

🥬 The Concept: Artifact grounding anchors abstract nodes to real directories/files using lowest-common-ancestor paths and AST-based dependencies. How it works:

Compute minimal directory scopes covering a group.
Attach metadata like types and file paths.
Add import/call edges from code parsing. Why it matters: Without grounding, the hierarchy floats above the real code; with it, you can jump straight to files and navigate call chains. 🍞 Anchor: 'Preprocessing/Scaling' maps to 'sklearn/preprocessing' and shows which scaler functions import numpy helpers.

🍞 Hook: When you fix a small typo in a report, you don’t reprint the whole book.

🥬 The Concept: Incremental evolution updates only the changed parts of the graph when commits land. How it works:

Parse diffs to detect additions, deletions, or modifications.
Regenerate features only for touched entities.
Re-route nodes only if their purpose truly shifts. Why it matters: Full rebuilds are costly; tiny updates keep the map fresh 23× cheaper. 🍞 Anchor: Changing 'spearmanr' to 'np.asarray' updates just that function node and nearby edges, not the whole repo.

🍞 Hook: Think of a Swiss Army knife with three tools you always use first.

🥬 The Concept: A unified interface exposes three tools—SearchNode (find by behavior), FetchNode (get exact code), ExploreRPG (follow edges). How it works:

Search by behavior phrases to get candidates.
Fetch precise file paths and line ranges.
Traverse dependencies up/down to see impact. Why it matters: One interface means less guessing, fewer wrong turns, and faster, safer edits. 🍞 Anchor: 'Normalize SVM votes' → SearchNode finds the SVM decision function → FetchNode shows its lines → ExploreRPG reveals helper calls to tweak.

03Methodology

At a high level: Input (Raw Code + Commits) → Encoding (Lift semantics → Build hierarchy → Ground artifacts) → Evolution (Incremental updates) → Operation (Search/Fetch/Explore) → Output (A unified, navigable RPG).

🍞 Hook: Imagine turning a messy garage into labeled shelves, then keeping it tidy with quick nightly cleanups, and finally using a flashlight and a path map to grab exactly what you need.

🥬 The Concept: RPG-Encoder is a three-part recipe to build, maintain, and use a dual-view map of a repository. How it works:

Encoding: Convert code into features, assemble a functional tree, and attach real file/dependency links.
Evolution: On commits, only adjust the parts that changed.
Operation: Provide tools to search by behavior, fetch exact code, and traverse dependencies. Why it matters: Without this pipeline, agents drown in text or wander a structure with no meaning. 🍞 Anchor: A scikit-learn bug fix goes from report → matched function → exact file lines → related helpers, all inside one map.

Step A: Encoding

Phase 1: Semantic Lifting 🍞 Hook: You know how you label moving boxes 'kitchen' or 'books' so movers know where they go? 🥬 The Concept: Extract short behavior tags for each function/class and summarize files. How it works:
1. Batch-parse code entities (functions/classes).
2. Generate concise, normalized verb-object features.
3. Summarize file-level purpose from its entities. Why it matters: These tags line up natural-language queries with the exact code units. 🍞 Anchor: 'def check_increasing' → 'check monotonic trend'.
Phase 2: Semantic Structure Reorganization 🍞 Hook: Shelving books by topic makes finding one way faster than sorting by spine color. 🥬 The Concept: Build a three-level functional hierarchy from features. How it works:
1. Discover top-level functional areas.
2. Route files under best-fit categories/subcategories.
3. Insert intermediate nodes when direct links are too coarse. Why it matters: The tree prunes the search space from entire repos to highly relevant branches. 🍞 Anchor: 'Algorithms/Classification/SVM' contains the SVM decision function nodes.
Phase 3: Artifact Grounding 🍞 Hook: A mall directory matters only if it tells you which floor and store number. 🥬 The Concept: Tie abstract nodes to real paths and add dependency edges. How it works:
1. Compute minimal directory scopes via lowest common ancestors with trie checks.
2. Attach metadata (type, file path) to nodes.
3. Parse AST to add imports/calls edges. Why it matters: You can jump from an intent category to exact files and follow call chains. 🍞 Anchor: 'Preprocessing/Scaling' anchors to 'sklearn/preprocessing' and shows which scalers call numpy.

Step B: Evolution (Incremental Maintenance) 🍞 Hook: Patch a hole in a fence; don’t rebuild the yard. 🥬 The Concept: Update only changed nodes and nearby edges when commits arrive. How it works:

Detect deletes, inserts, and modifies at function/file granularity.
For modification, update features in place if intent is stable; re-route only if intent drifts.
Refresh local dependency edges by re-parsing just affected ASTs. Why it matters: 95.7% token cost reduction across histories means sustainable long-term syncing. 🍞 Anchor: A renamed helper triggers a small local re-parse of edges, not a full graph rebuild.

Step C: Operation (Unified Reasoning Substrate)

Tool 1: SearchNode 🍞 Hook: Ask a librarian, 'Where are the baking books about sourdough?' 🥬 The Concept: Find code by behavior phrases or symbols and optionally restrict to subtrees. How it works:
1. Feature-based matching maps intent to nodes.
2. Snippet search can use identifiers/paths if needed.
3. Scoping keeps searches precise. Why it matters: This avoids scanning entire repos for a single behavior. 🍞 Anchor: Query 'normalize SVM votes' returns the decision_function node.
Tool 2: FetchNode 🍞 Hook: Before you buy, you read the exact page to be sure it’s right. 🥬 The Concept: Retrieve exact file paths, line ranges, and previews for candidates. How it works:
1. Input candidate nodes.
2. Return precise metadata and code snippet.
3. Confirm semantic fit. Why it matters: Prevents reasoning on guesses; grounds edits to real code. 🍞 Anchor: Fetch shows lines 768–798 of the SVM decision function for inspection.
Tool 3: ExploreRPG 🍞 Hook: Follow hallway arrows to see which rooms connect next. 🥬 The Concept: Traverse calls/imports and hierarchy up or down from anchors. How it works:
1. Start from validated nodes.
2. Walk upstream (dependencies) or downstream (dependents).
3. Filter by entity types and edge kinds. Why it matters: Reveals impact surfaces and root causes without guesswork. 🍞 Anchor: From decision_function, walk to normalization helpers and vote aggregation.

Secret Sauce 🍞 Hook: Good smoothies blend fruits that taste great together; bad ones mix random things. 🥬 The Concept: The clever bit is coupling semantic features and dependency edges inside one hierarchy, then evolving it incrementally. How it works:

Meaning narrows where to look.
Structure shows how to change it safely.
Incremental updates keep it fresh cheaply. Why it matters: Any one part alone underperforms; the combo is what unlocks big accuracy and efficiency gains. 🍞 Anchor: That’s why function-level Acc@5 jumps to 93.7% on SWE-bench Verified while using fewer steps and dollars.

04Experiments & Results

🍞 Hook: When you race, you don’t just say you finished—you compare lap times and who you beat.

🥬 The Concept: The team tested RPG-Encoder on two fronts: finding the right code (localization) and rebuilding a repo (reconstruction). How it works:

Repository Understanding: SWE-bench Verified and SWE-bench Live Lite measure how well the system pinpoints files/functions from issue texts.
Repository Reconstruction: RepoCraft checks if the map is complete enough to rebuild projects in correct order and pass tests.
Metrics: Acc@k (is the answer in top-k?), Precision/Recall (how clean and complete are picks?), Coverage and Pass Rate (how much was rebuilt and how correct?). Why it matters: Strong numbers on both finding and building show the map is both navigable and faithful. 🍞 Anchor: It’s like a city map that both guides you to a cafe fast and also lets you reconstruct the whole city layout.

The Test

SWE-bench Verified: Human-validated issues on well-known repos.
SWE-bench Live Lite: Newer issues to avoid training contamination.
RepoCraft: Rebuild repositories like Requests or scikit-learn from the representation, not from docs alone.

The Competition

Agentless: Text narrowing without graph priors.
LocAgent: Graph-guided traversal using dependency schemas.
CoSIL: Iterative search on static code graphs.
OrcaLoca: Adds dynamic execution signals and planning.

The Scoreboard (with context)

Function-level localization on SWE-bench Verified: With Claude-4.5, RPG-Encoder hits 93.7% Acc@5—a solid A+, while strong baselines sit about a letter grade lower.
On SWE-bench Live Lite, function Acc@5 improves by 10–15 points over best baselines depending on backbone, showing robustness across fresh issues.
Reconstruction on RepoCraft: 98.5% coverage and 86.0% unit-test pass rate with GPT-5-mini. Docs-only baselines recover roughly 17% of code volume and far fewer passing tests—like trying to rebuild a city from tourist brochures.
Efficiency: Fewer steps and lower cost per correct hit. Example: on GPT-5, highest Acc@5 per dollar (about 4.15) compared to ~1–3 for others.

Surprising/Notable Findings

Dual-view is necessary: Ablations show removing semantic features hurts function-level hits the most; removing dependencies hurts file-level retrieval and traversal.
Incremental updates preserve fidelity: Despite 95.7% cost reduction, accuracy stays essentially on par with full rebuilds across commit histories.
Behavior pattern: Tools induce a 'Search-then-Zoom' habit—agents first scan the topology, then dive deep, showing the interface shapes smarter exploration.

🍞 Hook: Think of grading not just by the final answer but by how few hints you used to get there.

🥬 The Concept: RPG-Encoder makes agents both more accurate and more frugal. How it works:

Intent tags quickly shortlist true candidates.
Grounded edges reveal exactly what to read next.
Shorter paths mean lower token and time costs. Why it matters: Real teams need speed and budgets to hold. 🍞 Anchor: Fixing a regression goes from a dozen blind file reads to one targeted function fetch and a couple of edge hops.

05Discussion & Limitations

🍞 Hook: Even a great map can be less helpful in a tiny village or during a sudden earthquake.

🥬 The Concept: RPG-Encoder is powerful but not magic; it has limits and best-use conditions. How it works:

Limitations: Very small repos may not benefit from a full hierarchy; highly dynamic behavior (runtime-generated code) can elude static edges; nuanced intent shifts can be misjudged in diffs; non-Python or polyglot repos may need specialized parsers.
Required Resources: LLM access for feature extraction/routing; AST/static analysis tools; storage for the graph; CI integration for commit hooks.
When Not to Use: One-file scripts; projects where behavior depends mostly on runtime/dynamic import tricks; secret-heavy repos where code cannot be indexed.
Open Questions: How to blend runtime traces with static edges safely? How to generalize to multi-language monorepos at scale? How to auto-detect subtle semantic drift more reliably? Can models learn to predict optimal traversal policies directly from RPGs? Why it matters: Knowing boundaries prevents overpromising and guides the next research steps. 🍞 Anchor: For a tiny utility with two functions, a plain read may beat building a full RPG; for a huge ML library, RPG-Encoder shines.

06Conclusion & Future Work

🍞 Hook: Imagine a map that helps you find a cafe fast, fixes the street names when the city updates, and could even help rebuild the city if it vanished.

🥬 The Concept: RPG-Encoder turns codebases into a unified, evolving map that joins meaning with structure and supports both understanding and generation. How it works:

Lift semantics from functions/files.
Organize them into a functional hierarchy.
Ground to real files and dependency edges, then update incrementally and operate with search/fetch/traverse tools. Why it matters: It closes the loop—compress code into intent and expand intent into code—with high accuracy and efficiency. 🍞 Anchor: That’s why it reaches 93.7% Acc@5 for function localization and 98.5% reconstruction coverage with strong pass rates.

3-Sentence Summary: RPG-Encoder unifies documentation-like meaning and graph-like structure into one Repository Planning Graph that stays in sync with commits. It lets agents find the right spot to change and also rebuild repositories in topological order. Experiments show big accuracy gains, strong fidelity, and far lower maintenance costs. Main Achievement: Demonstrating that a single, dual-view, incrementally updated representation can outperform fragmented approaches on both navigation and reconstruction. Future Directions: Polyglot support, blending runtime signals with static edges, automated drift detection, and learned traversal strategies on top of RPGs. Why Remember This: It’s a blueprint for closed-loop software engineering—one representation that helps you understand, change, and rebuild complex codebases confidently.

Practical Applications

•Speed up bug localization by matching issue text to function-level nodes and traversing only relevant dependencies.
•Guide safe refactors by visualizing impacted modules along dependency edges from the edited node.
•Automate repository reconstruction in controlled environments using the RPG’s topological order.
•Keep architectural docs in sync by generating up-to-date functional hierarchies from code changes.
•Improve code review by attaching semantic features and dependency context to diffs.
•Onboard new engineers with a functional map that links high-level areas to exact files and call chains.
•Prioritize tests by following dependency paths from a change to likely affected modules.
•Harden CI pipelines with incremental RPG updates on every commit to detect semantic drift.
•Enhance code search by scoping queries to functional subtrees for higher precision.
•Support design audits by checking whether functional areas map cleanly to intended directory scopes.

Version: 1