Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey
Key Summary
- •This paper is the first big map of how AI can fix real software problems, not just write short code snippets.
- •It explains the full pipeline: read the issue, explore a big codebase, find the bug, write a patch, run tests, and repeat until it passes.
- •New datasets like SWE-bench and its variants made the task realistic and hard, which pushed researchers to build smarter agents and better training data.
- •There are two main ways to build these systems: training-free (use tools, memory, and smart search at test time) and training-based (fine-tune or use reinforcement learning).
- •Good data is everything: the field now automates collecting issues, setting up Docker environments, synthesizing bugs, and validating tests at scale.
- •Agents work better with add-ons like fault localization, code search, AutoDiff patching, test generation, verifiers, and memory to avoid repeating mistakes.
- •Results are rising fast: top systems can solve many more issues than older baselines, but efficiency, safety, and data leaks remain big concerns.
- •The survey lists key challenges ahead: compute cost, efficiency-aware metrics, multimodal (visual) bugs, safer agents, and finer-grained rewards.
- •It offers a clear taxonomy (Data, Methods, Analysis) and a living repository so the community can keep building together.
Why This Research Matters
Real software used by people and businesses breaks in complex ways, and fixing it quickly and safely is vital. This survey shows how AI can help with the whole job—finding the bug, editing code, and proving the fix—inside large, messy codebases. By mapping the ecosystem of data, tools, training methods, and benchmarks, it helps teams build agents that are more accurate, efficient, and secure. Stronger evaluations (like Verified/Live/Multimodal) restore trust in reported results and reduce accidental cheating or data leaks. In practice, this can shorten bug backlogs, speed up feature delivery, and reduce outages. Over time, safer and more efficient agents will let humans focus on creative design while AI handles routine maintenance.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: You know how fixing a bike is harder than building a LEGO model from instructions? With a bike, you must find what squeaks, choose the right tool, and test-ride it to see if the fix works.
🥬 The Concept (Large Language Models, LLMs): LLMs are smart text tools that read and write like people. How it works (simple):
- They read the words you give them.
- They use patterns they learned from lots of text.
- They write new words that (hopefully) solve your task. Why it matters: Without LLMs, we wouldn’t even try to automate complex software fixes. 🍞 Anchor: When you ask a chatbot to explain a bug or suggest a code snippet, that’s an LLM helping you.
🍞 Hook: Imagine walking into a huge library (a code repository) to find one misprinted sentence (a bug) that breaks a story (the program).
🥬 The Concept (Issue Resolution): Issue resolution means reading a problem report, finding the broken code in a big repo, changing it, and proving the fix by running tests. How it works:
- Read the issue description (what's broken).
- Explore the repository (where could the problem be?).
- Edit code to make a patch.
- Run tests to check everything still works. Why it matters: Without full issue resolution, AI would fix toy problems but fail on real apps. 🍞 Anchor: A user reports “clicking Save crashes the app.” The agent finds the bug in save_handler.py, edits two lines, and all tests pass.
🍞 Hook: Think of test races where cars run laps to compare speed and safety.
🥬 The Concept (SWE-bench Benchmark): SWE-bench is a standard set of real GitHub issues plus code and tests used to fairly measure AI agents. How it works:
- Pair real issues with exact repo snapshots.
- Build a runnable environment (like Docker) with dependencies.
- Apply an agent’s patch and run tests to see if it truly fixes the issue. Why it matters: Without a fair race track, we can’t tell which agent actually solves real problems. 🍞 Anchor: SWE-bench Verified is like the “referee crew” that double-checks that the racetrack isn’t broken and the lap times are honest.
🍞 Hook: Before this research, people mostly timed sprinters (short code tasks), not marathoners (full repo tasks).
🥬 The World Before: Early success came from function-level code generation (like HumanEval). But real software is a living city: many files, tests, build steps, and people. LLMs writing small snippets couldn’t easily navigate build systems, flaky tests, or multi-file dependencies. They also lacked tools to explore codebases or run commands. Why it matters: Trying to judge repo-level skills using tiny problems hid the real challenges. 🍞 Anchor: A model that can write a 10-line function may still fail to fix a bug in scikit-learn that requires changing two modules and updating tests.
🍞 Hook: Picture detectives who only read witness notes but never visit the crime scene.
🥬 The Problem: LLMs struggled with dynamic interaction: installing dependencies, running tests, searching files, and coordinating multiple steps. How it works (the challenge):
- Environments must be reproducible (Docker/Conda).
- Tools are needed to search, edit, and test.
- Long conversations/logs must be summarized so context doesn’t rot. Why it matters: Without these, agents wander, overthink, or break more than they fix. 🍞 Anchor: An agent that can’t run tests can’t know if its patch works, just like a chef who never tastes the soup.
🍞 Hook: Imagine trying to grow a forest from a few seeds.
🥬 Failed Attempts: Manually curated datasets were small and sometimes buggy; agents had brittle tool use; evaluations were polluted by data leaks or weak tests; and many systems prioritized fancy reasoning but skipped environment interaction. Why it matters: Results looked better than they were; agents were not robust in the wild. 🍞 Anchor: If test suites miss failures, an agent can submit a wrong patch that still “passes.”
🍞 Hook: Think of building a reliable school exam: clear questions, fair scoring, and no answer key leaks.
🥬 The Gap: The field lacked a clean taxonomy, scalable data pipelines (collection + synthesis), reliable environments, consistent scoring, and clear guidance on when to use training-free vs. training-based methods. Why it matters: Without this map, teams reinvent parts, compare apples to oranges, and miss key safety issues. 🍞 Anchor: This survey organizes the space into Data, Methods, and Analysis so teams can pick the right puzzle pieces and measure progress fairly.
🍞 Hook: When your favorite app breaks, you want it fixed fast—and safely.
🥬 Real Stakes: Faster bug fixes save time and money, strong tests prevent regressions, and safer agents avoid deleting critical files or cheating on benchmarks. Why it matters: These systems can help maintainers, startups, and big enterprises deliver reliable software, affecting everything from payments to healthcare. 🍞 Anchor: A company can use an agent to triage and repair non-critical bugs overnight, while engineers focus on new features by day.
02Core Idea
🍞 Hook: Imagine upgrading from a pocket flashlight (tiny code tasks) to a lighthouse (full app repairs across storms).
🥬 The Aha! Moment: Solving real software issues needs more than code generation—it needs the full stack: good data, runnable environments, smart tools, memory, search, and sometimes training (SFT/RL), all measured by rigorous benchmarks. Why it matters: Without this holistic view, agents seem smart in demos but fail in real repos. 🍞 Anchor: Agents that combine file search, AutoDiff edits, test execution, and verifier checks outperform plain prompt-only models.
🍞 Hook: Picture three different ways to understand the same idea.
🥬 Multiple Analogies:
- City Planner: The issue is a traffic jam; the agent studies the map (repo graph), tests detours (run tests), adjusts a road (patch), and checks city flow (regressions).
- Kitchen Brigade: The issue is a salty soup; the agent tastes (reproduce), identifies the pot (localize), adds water (patch), and re-tastes (tests) before serving.
- Orchestra Conductor: The issue is off-beat percussion; the agent listens (logs), pinpoints drums (localize), changes tempo (patch), and rehearses (tests) till the ensemble is in sync. 🍞 Anchor: In each analogy, success comes from tools, feedback, and coordination—not just guessing a fix.
🍞 Hook: What changes when we adopt this idea?
🥬 Before vs. After:
- Before: Short snippets, static context, optimistic scoring, limited tools.
- After: Repository-wide reasoning, automated environments, richer tools (localization, code search, verifiers), memory, and stronger evaluation (Verified/Live/Multilingual/Multimodal). Why it matters: Performance goes up and results are more trustworthy across languages and visual tasks. 🍞 Anchor: A model that once solved 1-in-10 Python issues can now, with tools and better data, solve many more across Python, Java, and JS.
🍞 Hook: Why does this approach work logically?
🥬 Why It Works (intuition without math):
- Narrowing the search: Localization and retrieval shrink the haystack around the needle.
- Feedback loops: Running tests gives immediate truth signals that guide better edits.
- Structure over free-form: AutoDiff and patch formats reduce silly editing mistakes.
- Memory: Don’t repeat errors—reuse strategies that worked before.
- Test-time scaling: Explore multiple promising paths and keep the best. Why it matters: Each part acts like a superpower; together they compound. 🍞 Anchor: Monte Carlo Tree Search (MCTS) plus verifiers helps the agent avoid getting stuck and backtrack to better choices.
🍞 Hook: Let’s break the big idea into bite-sized blocks.
🥬 Building Blocks (with mini sandwich intros):
-
Tools 🍞 Hook: Like a toolbox with a screwdriver, wrench, and tester. 🥬 What: Bug reproduction, fault localization (e.g., SBFL), code search (BM25, AST), patching (AutoDiff), validation (tests, LSP), and test generation. How: Each tool handles a stage in the pipeline. Why: Without tools, the agent acts blindly. 🍞 Anchor: Issue2Test writes a failing test first; then the agent knows exactly what to fix.
-
Memory 🍞 Hook: Think of a notebook of past puzzles and tricks. 🥬 What: Stores useful traces, strategies, and summaries. How: Save successful/failed attempts; retrieve relevant ones next time. Why: Without memory, the agent repeats mistakes and wastes tokens. 🍞 Anchor: SWE-Exp/ReasoningBank recall a localization trick that worked on a similar repo.
-
Inference-time Scaling 🍞 Hook: Like trying multiple routes to school and picking the fastest. 🥬 What: Run several reasoning paths (or tree search) in parallel, then select using verifiers. How: MCTS, parallel rollouts, strong selection heuristics. Why: One shot is risky; many shots plus checking boosts reliability. 🍞 Anchor: SWE-Search explores alternative edits; the best candidate passes all tests.
-
Training (SFT/RL) 🍞 Hook: Practice with answer keys (SFT) and practice with rewards (RL). 🥬 What: SFT learns from curated examples; RL improves policy via test feedback and process rewards. How: Scale diverse data; use curricula; reward steps, not just final success. Why: Without training, base models plateau on complex repos. 🍞 Anchor: RL with process rewards teaches the agent to localize before editing, lifting success rates.
03Methodology
🍞 Hook: Imagine a treasure hunt where the map (issue), the island (repo), and the compass (tests) guide you to the prize (a correct patch).
🥬 Overview Recipe: Input (Issue description + Repository snapshot) → Explore and Reproduce → Localize Fault → Retrieve Context → Generate Patch → Validate with Tests → Iterate or Submit → Output (Patch file) Why it matters: Skipping steps makes agents guessy and unsafe. 🍞 Anchor: On scikit-learn, the agent reads the bug, finds the related file, edits carefully via AutoDiff, and runs pytest to confirm the fix.
- Input and Setup 🍞 Hook: Before baking a cake, you gather ingredients and preheat the oven. 🥬 What: The system loads the issue text and checks out the exact repo commit; a Docker/Conda environment ensures reproducibility. How:
- Parse the issue for symptoms, versions, and hints.
- Clone repo at the specified commit.
- Build environment using CI-configured steps. Why: Without a stable environment, results are flaky and irreproducible. 🍞 Anchor: SWE-bench-Live and RepoForge automatically build per-issue images and verify they run tests.
- Explore and Reproduce the Bug 🍞 Hook: Detectives first try to make the problem happen again. 🥬 What: Run commands or tests to trigger the failure. How:
- Use bug reproduction scripts or generate failing tests (Issue2Test/Otter).
- Capture stack traces and logs. Why: Without reproduction, you can’t be sure you’re fixing the right thing. 🍞 Anchor: A failing test that asserts a wrong gradient output gives a clear target signal.
- Fault Localization 🍞 Hook: When a light goes out, you check the bulb, switch, or fuse box. 🥬 What: Narrow down suspicious files/lines. How:
- Spectrum-based localization (SBFL) and coverage.
- Graph-based methods (dependency/call graphs) to trace fault flow.
- Heuristics from error logs and commit history. Why: Without localization, search space explodes and the agent wastes tokens. 🍞 Anchor: SBFL ranks gradient_boosting.py near the top; the agent examines it first.
- Retrieve Context (Code Search) 🍞 Hook: Finding a sentence is easier if you first pick the right chapter. 🥬 What: Pull only the most relevant files, functions, and docs into the prompt. How:
- BM25, AST-aware chunking, LSP-based navigation.
- Knowledge graphs for cross-file relationships.
- Iterative retrieval that expands/zooms as needed. Why: Without smart retrieval, the model hallucinates or misses critical dependencies. 🍞 Anchor: The agent fetches the target class, its tests, and one helper module referenced in the stack trace.
- Generate Patch (Editing) 🍞 Hook: Use a careful eraser and a fine pen, not a paint bucket. 🥬 What: Produce an exact code edit using stable formats (AutoDiff) and style rules. How:
- Constrain edits (no wild multi-file changes on first try).
- Follow inferred specifications (SpecRover) and repository conventions.
- Consider multiple candidates when uncertain. Why: Without structure, agents break syntax or patch the wrong lines. 🍞 Anchor: The patch changes 2 lines, adds a guard for warm starts, and updates a docstring.
- Validate with Tests (Verification) 🍞 Hook: Chefs always taste the soup before serving guests. 🥬 What: Run unit/system tests in a sandbox; optionally do static checks (LSP) and QA agent review. How:
- Execute the repo’s test suite (F2P must now pass; P2P must stay green).
- Parse logs robustly across frameworks.
- If failure, classify error and guide the next edit. Why: Without verification, agents can fix one thing and break three others. 🍞 Anchor: After patching, all tests pass and no new failures appear.
- Iterate or Submit (Search Strategy) 🍞 Hook: If a route is blocked, try a different street. 🥬 What: Use non-linear search (MCTS), parallel rollouts, and verifiers to choose the best candidate. How:
- Branch on different patches or retrieval scopes.
- Keep memory of what didn’t work to avoid loops.
- Stop when tests fully pass or budget is reached. Why: Without smart search, agents overthink or get stuck. 🍞 Anchor: Three candidate patches run; the one that passes smoke tests and full suite is kept.
- Training Paths (optional but powerful) A) Supervised Fine-Tuning (SFT) 🍞 Hook: Practice with worked examples. 🥬 What: Train on curated issue→trajectory→patch datasets (real + synthetic). How:
- Scale data (SWE-Smith/SWE-Fixer/SWE-Flow, etc.).
- Use curricula (from broad trajectories to clean, high-quality subsets).
- Rejection-sample on only verified good traces. Why: Without SFT, base models lack repo-specific instincts. 🍞 Anchor: An SFT model learns to prefer AutoDiff edits over free-form code blocks.
B) Reinforcement Learning (RL) 🍞 Hook: Learn by trying and getting scored. 🥬 What: Optimize policies using outcome rewards (tests pass) and process rewards (good steps like correct localization). How:
- Algorithms: GRPO, PPO, DPO variants.
- Scaffolds: OpenHands, Agentless, SWE-Gym/R2E-Gym.
- Rewards: combine pass/fail with shaped signals for intermediate skills. Why: Outcome-only rewards are too sparse; process rewards speed learning. 🍞 Anchor: With process rewards, a 7–14B agent improves localization-first behavior and final success.
Secret Sauce (why this survey’s recipe works)
- Unified Taxonomy: Data (collection/synthesis), Methods (training-free/training-based), Analysis (data/methods). Teams can plug pieces in confidently.
- Automated Data Pipelines: RepoLaunch/Factory/Forge build and verify environments at scale.
- Robust Verification: SWE-bench Verified/SPICE curb data leaks and weak tests.
- Practical Modules: Tools + Memory + Test-time Scaling close the gap between demos and production.
04Experiments & Results
🍞 Hook: Think of a science fair where each robot tries to fix real gizmos while judges watch carefully.
🥬 The Test: What did they measure and why?
- Resolved Rate: Did the agent’s patch make failing tests pass without breaking others? This is the main scoreboard.
- Patch Validity and Code Quality: Is the patch correctly applied, minimal, and aligned with project style?
- Efficiency and Safety: How much compute/token cost per solve? Any unsafe behavior or cheating? Why it matters: Winning isn’t just fixing; it’s fixing reliably, efficiently, and safely. 🍞 Anchor: SWE-bench Verified is like pro judges ensuring the gizmo really works and no rules were broken.
🍞 Hook: Who were the competitors?
🥬 The Competition:
- Training-free agents with strong tools (SWE-agent, AutoCodeRover, OpenHands variants, Agentless workflows).
- SFT-trained models specialized for SWE tasks (SWE-Lego, SWE-Swiss, Devstral, Co-PatcheR).
- RL-enhanced models with outcome and/or process rewards (SWE-RL, DeepSWE, SeamlessFlow, Satori-SWE, SEAlign).
- General foundation models with powerful inference scaffolds (MiMo-V2-Flash, DeepSeek V3.2, Qwen3-Coder, GLM-4.6), serving as baselines for what big models can do with the right tooling. Why it matters: Comparing across these groups shows the gains from tools, data, SFT, and RL. 🍞 Anchor: Big general models with a good scaffold can be strong, but SWE-specialized training plus tools often narrows or beats that gap at lower sizes.
🍞 Hook: What does the scoreboard look like with context?
🥬 The Scoreboard (with meaning):
- Early agents on raw SWE-bench struggled, sometimes below 10–20% on strict subsets; after better data and tools, mid-range systems achieved results comparable to a solid B grade.
- Top-tier setups on curated/verified tracks report much higher rates—like jumping from a class average of B− to A/A+—but often with significant compute budgets and careful scaffolds.
- Multilingual and multimodal tracks remain tougher; performance there is more like a C+ to B right now, showing headroom. Why it matters: Scores improve with cleaner data, verified environments, tool-rich scaffolds, and thoughtful training, but gains aren’t uniform across domains. 🍞 Anchor: An agent might score high on Python but drop on TypeScript or on visual UI fixes, signaling uneven strengths.
🍞 Hook: Any surprises?
🥬 Surprising Findings:
- Data quality matters as much as algorithms: when weak tests or leaks are removed, some “wins” vanish.
- Small/medium models with RL and process rewards can rival much larger models, especially when tools and verifiers are strong.
- Overthinking is real: agents sometimes spend tokens “thinking” instead of running a simple test that would quickly guide them.
- Memory helps: experience banks reduce repeated mistakes and shorten solve times across related repos. Why it matters: Smart engineering (data, rewards, tools) can beat raw parameter count in many cases. 🍞 Anchor: A 14–32B agent trained with process rewards and MCTS search sometimes catches up to or surpasses much larger base models on verified tracks.
🍞 Hook: So what’s the takeaway?
🥬 Takeaway: Benchmarks evolved (Verified, Live, Multilingual/Multimodal), methods matured (tools, memory, search, SFT/RL), and results climbed—but efficiency, safety, and generalization remain open fronts. 🍞 Anchor: Think of a team that went from barely finishing the race to running near the front pack, now training to sprint up hills (multimodal) without using extra fuel (efficiency).
05Discussion & Limitations
🍞 Hook: Even great teams have weak spots and training needs.
🥬 Limitations:
- Compute and Storage: Training and evaluation require many parallel containers; test-time scaling can be costly.
- Efficiency Blind Spots: Many leaderboards emphasize success rate over time/cost, hiding practical trade-offs.
- Multimodal Gaps: Visual UI issues are under-tested; simply turning images into text loses crucial alignment.
- Safety: Agents have, in rare cases, deleted codebases or gamed evaluations; more safeguards and audits are needed.
- Data Leaks and Weak Tests: Some benchmarks inadvertently reveal solutions or accept wrong patches. 🍞 Anchor: A robot that wins in a lab with spare parts and no budget limits might struggle in a busy factory.
🥬 Required Resources:
- Reliable environment builders (Docker/Conda, CI workflow parsing), GPU/CPU clusters for parallel rollouts, and robust log parsers.
- Datasets at scale (collection + synthesis) and verifiers (SPICE-like) to ensure clean signals. 🍞 Anchor: Think of needing both a clean kitchen and fresh ingredients before cooking reliably.
🥬 When NOT to Use:
- Highly sensitive production repos without strict sandboxes and human oversight.
- Projects lacking runnable tests or with flaky, long-running suites; agents will thrash.
- Tight budgets where token/compute costs must be minimal; start with targeted tools, not full autonomy. 🍞 Anchor: Don’t let a rookie driver test a race car on a busy highway.
🥬 Open Questions:
- Can we design fine-grained, trustworthy process rewards that transfer across repos and languages?
- How to standardize efficiency and safety metrics across leaderboards?
- What are best practices for multimodal code-UI alignment (beyond textifying images)?
- How to automate decontamination and benchmark upkeep at scale?
- Can autonomous context management prevent “context rot” and slash token costs? 🍞 Anchor: The next playbook needs rules for speed, fairness, and safe driving—across many tracks and weather conditions.
06Conclusion & Future Work
🍞 Hook: Imagine a field guide that finally labels every trail, tool, and hazard sign in a giant coding forest.
🥬 3-Sentence Summary: This survey organizes LLM-based issue resolution into a clear framework of Data (collection/synthesis), Methods (training-free with tools/memory/search vs. training-based with SFT/RL), and Analysis (data/method audits). By connecting realistic benchmarks (like SWE-bench Verified, Multilingual, and Multimodal) to practical agent designs, it explains why success needs runnable environments, robust tools, and trustworthy evaluation. It also surfaces key challenges—efficiency, safety, multimodal gaps, and data hygiene—and points to concrete solutions.
🥬 Main Achievement: A comprehensive, accurate taxonomy plus a living repository that lets researchers and practitioners pick the right components, compare fairly, and build stronger, safer agents.
🥬 Future Directions: Standardize efficiency/safety metrics; design finer process rewards; build richer multimodal datasets; develop autonomous context management; and advance automated decontamination/verification pipelines. Expect smaller, specialized models with strong tools and RL to compete with giants in many settings.
🍞 Anchor: If you remember one thing, remember this: real bug fixing is a team sport—data, tools, memory, search, and training must play together on a verified field for wins to truly count.
Practical Applications
- •Automated triage and fixing of low-risk bugs overnight with human review in the morning.
- •Continuous maintenance agents that detect regressions and propose patches after dependency updates.
- •Cross-language bug localization for polyglot repos (Python, Java, JS/TS, Go, Rust, etc.).
- •Frontend/UI issue fixing with multimodal agents that align screenshots and code changes.
- •Enterprise codebase onboarding: agents build runnable environments and generate reproduction tests.
- •Security hardening: agents propose safer patches and run verifiers/static checks before PRs.
- •Developer coaching: agents explain failing tests, suggest minimal diffs, and cite related code.
- •Benchmarking and vendor evaluation using SWE-bench Verified to choose reliable tooling.
- •Cost-aware pipelines that balance test-time scaling with token/compute budgets.
- •Building in-house datasets (collection + synthesis) to fine-tune agents on company-specific patterns.