FullStack-Agent: Enhancing Agentic Full-Stack Web Coding via Development-Oriented Testing and Repository Back-Translation
Key Summary
- ā¢This paper builds an AI team that can make real fullāstack websites (frontend, backend, and database) from plain English instructions.
- ā¢Most earlier systems faked interactivity with pretty pages but no real data flow; this work fixes that with planning, coding, and strong debugging tools.
- ā¢FullStack-Dev is a multi-agent framework with a planner plus separate frontend and backend coders, each armed with smart debugging tools.
- ā¢FullStack-Learn teaches the AI to be a better developer by turning real GitHub repos into stepābyāstep building lessons using backātranslation and augmentation.
- ā¢FullStack-Bench fairly tests websites at three levelsāfrontend, backend, and databaseāand only gives credit when database actions truly happen.
- ā¢On the new benchmark, FullStack-Dev beats the previous best by 8.7% (frontend), 38.2% (backend), and 15.9% (database).
- ā¢Self-improvement training boosts a 30B model by 9.7% (frontend), 9.5% (backend), and 2.8% (database) without using a bigger teacher model.
- ā¢Special debugging tools (a Postman-like backend tester and a GUI-agent with error watching) cut mistakes and speed up coding.
- ā¢The approach scales by creating more high-quality training data through repository augmentation and careful filtering.
- ā¢Human checks show the benchmarkās judging is reliable (over 90% agreement).
Why This Research Matters
Websites run our lives: shopping, learning, donating, scheduling, and more. If an AI builds a site that looks right but doesnāt really save data, people lose trust and businesses lose money. This work shows how to make AI-generated sites genuinely full-stack by planning carefully, testing the backend and database, and learning from real code. The result is fewer fake successes, quicker bug fixes, and more production-like behavior. It also sets a fairer standard for judging website builders by checking actual data flow. Over time, this can make software creation faster, safer, and more dependable for everyone.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Top Bread (Hook): Imagine you ask a friend to build a lemonade stand website. It looks beautiful: buttons, pictures, animations. You click āSubmit Order,ā see a friendly āSuccess!ā message⦠but no lemons get counted and no orders are stored. Itās all looks, no lemonade.
š„¬ Filling (The Actual Concept ā Agentic Coding) ⢠What it is: Agentic coding means an AI doesnāt just write code once; it plans, decides, runs tools, tests, and fixes bugs like a real developer. ⢠How it works:
- Read your instruction (what you want built).
- Make a plan for parts to build.
- Write some code.
- Run the code and tools.
- See errors and fix them.
- Repeat until it works. ⢠Why it matters: Without this, the AI often makes a single pretty page and stops, leaving out the hard parts (like servers and databases) that make real websites actually function. š Bottom Bread (Anchor): When you ask for āa recipe app that saves favorites,ā agentic coding builds both the page you see and the server that truly saves your picks.
The World Before: LLM-powered coding agents could spin up nice-looking pages fast. Many products and papers showed clickable components, forms that āsubmitted,ā and charts that āupdated.ā But under the hood, lots of these demos used mock data or no data at all. Benchmarks mostly judged by watching screens: if it looked right, it passed. This rewarded paint over plumbing.
The Problem: Real full-stack apps are hard. You need a frontend (what users see), a backend (APIs that do the work), and a database (where data lives). The tricky parts include:
- Managing data flow end-to-end (form ā API ā database ā API ā page reload),
- Handling huge codebases with complex folders and ever-changing dependencies,
- Locating obscure, messy bugs quickly.
š Top Bread (Hook): You know how a restaurant works: the waiter takes orders, the kitchen cooks, and the pantry stores ingredients. š„¬ Filling (The Actual Concept ā Database Interaction) ⢠What it is: Database interaction is how apps store and fetch information (like orders) so itās not lost. ⢠How it works:
- Frontend sends a request (e.g., save new order) to the backend.
- Backend talks to the database to insert or read rows.
- Backend returns results, which the frontend shows. ⢠Why it matters: Without real database calls, the site can pretend to work, but nothing is actually saved. š Bottom Bread (Anchor): When you add a book to your wishlist, the database keeps it there so it still shows up next time you visit.
Failed Attempts: Previous systems often:
- Generated only HTML/CSS (no server, no DB),
- Stuffed all code into one huge context instead of navigating files smartly,
- Used GUI agents that clicked around blindly, missing backend mistakes,
- āPassedā tests that never checked if the database was touched.
The Gap: The field needed:
- A developer-like process (planning, coding, targeted debugging),
- A way to train models on real, working codebases (not just toy prompts),
- A benchmark that demands real backend and database activityānot just pretty screens.
š Top Bread (Hook): Picture two halves of a zipper: they only work when they interlock. š„¬ Filling (The Actual Concept ā Frontend and Backend Integration) ⢠What it is: Making the visible page and the hidden server work together seamlessly. ⢠How it works:
- Frontend reads the plan for what data it needs.
- It calls backend APIs for that data.
- Backend talks to the database and returns real results.
- Frontend shows the live data and sends updates back. ⢠Why it matters: Without this, pages show fake results or nothing at all. š Bottom Bread (Anchor): A signup page that truly creates an account in the database and then logs you in.
Real Stakes (Why You Should Care):
- If a charity site says āDonation received!ā but never records it, people lose trust.
- If a store shows inventory that isnāt in the database, customers get angry.
- If student grades appear updated on screen but never save, schools suffer. We need websites that are not just beautiful, but truthfulādriven by real data and verified by strong tests. This paperās system (FullStack-Agent) aims to deliver exactly that: real plumbing, not just paint.
02Core Idea
The āAha!ā Moment in One Sentence: Treat the AI like a real dev team that plans, codes, debugs with the right tools, learns from real repos, and is graded by tests that only pass when the database truly changes.
Three Analogies for the Same Idea:
- Cooking Show: Donāt just plate pretty food; the meal must be fully cooked through. Use thermometers (debug tools), recipes (plans), taste tests (benchmark), and learn by recreating famous dishes (back-translation from real repos).
- Sports Team: Assign rolesācoach (planner), offense (frontend), defense (backend), and review plays on video (debug logs). Practice drills on real game tapes (repos). Score only counts when the ball crosses the goal line (DB writes).
- House Building: Architect (plan), interior designer (frontend), electrician/plumber (backend), inspectors (benchmark). Learn by studying existing houses and rebuilding them (back-translation) and trying variations (augmentation).
š Top Bread (Hook): You know how a group project works better when each person has a clear job? š„¬ Filling (The Actual Concept ā Multi-Agent System) ⢠What it is: A team of cooperating AIs, each with a role (planner, frontend coder, backend coder), sharing information to build faster and better. ⢠How it works:
- Planner turns your idea into precise designs (what pages, what APIs, what data types).
- Backend agent implements APIs and connects the database.
- Frontend agent builds pages that call those APIs.
- They use tools to run, test, and fix issues. ⢠Why it matters: One giant agent gets overwhelmed; a team mirrors real software development and reduces mistakes. š Bottom Bread (Anchor): When you ask for a ātask manager with tags,ā the planner defines endpoints, the backend builds them, and the frontend consumes themālike teammates passing the ball.
Before vs After:
- Before: Systems often faked success on the screen without real data changes. Benchmarks couldnāt reliably catch that.
- After: The AI builds full-stack apps with real data flow. Tests check not just the UI, but also whether APIs work and the database was actually touched.
Why It Works (Intuition, No Equations):
- Structure reduces confusion: A plannerās typed JSON blueprint tells everyone exactly what to build.
- Tools target the hard parts: A GUI-agent that watches error logs and a Postman-like API tester zero in on real bugs fast.
- Learning from the real world: Turning existing repos into step-by-step building lessons (back-translation) teaches practical patterns and edge cases.
- Honest grading: Tests only count when database logs show the right operations, so āfake-itā frontends donāt skate by.
Building Blocks (The Key Pieces):
- FullStack-Dev: A multi-agent development line with specialized debugging tools.
- FullStack-Learn: A self-improvement loop that studies existing repos by back-translation and creates more data through augmentation.
- FullStack-Bench: A three-part exam (frontend, backend, database) that validates true functionality.
š Top Bread (Hook): Imagine reading a great novel and then rewriting it, chapter by chapter, to learn how the author built it. š„¬ Filling (The Actual Concept ā Back-Translation) ⢠What it is: Convert a finished repository into a step-by-step āhow to build it from scratchā trajectory for training. ⢠How it works:
- An agent reads the repo to summarize features and quality.
- Another agent rebuilds it inside a clean template, following high-level plans.
- A rule-based cleaner removes any direct copying traces and replays tool outputs. ⢠Why it matters: This turns real, working codebases into clear lessons the AI can learn from, at scale. š Bottom Bread (Anchor): Itās like watching a chef deconstruct a cake to teach you the exact steps to bake it yourself.
š Top Bread (Hook): Think of practicing a song in different keys to get better faster. š„¬ Filling (The Actual Concept ā Iteration & Self-Improvement via FullStack-Learn) ⢠What it is: The model improves itself in rounds by generating training data from real and augmented repos, then retraining. ⢠How it works:
- Round 1: Back-translate real repos; train.
- Augment repos (simplify, extend, parallel apps); back-translate more; combine with round 1.
- Final training produces a stronger developer model. ⢠Why it matters: No need for a bigger teacher modelājust smart data generation from real projects. š Bottom Bread (Anchor): Like practicing from real songs, then remixing them to learn more patterns, and performing again better each time.
š Top Bread (Hook): You know how exams can be easy or hard depending on what they ask? š„¬ Filling (The Actual Concept ā FullStack-Bench) ⢠What it is: A test suite that checks frontend actions, backend APIs, and database contents together. ⢠How it works:
- GUI-agent runs UI tasks; results only count if database logs show correct operations.
- Backend judge discovers endpoints and calls them; checks responses and DB logs.
- Database judge inspects tables/rows to verify requirements. ⢠Why it matters: If you donāt touch the database when you should, you donāt passāeven if the UI looked good. š Bottom Bread (Anchor): A contact form only passes when the test sees both the āmessage sentā on screen and an INSERT in the messages table.
03Methodology
At a high level: Instruction ā Planning Agent (frontend/backend designs) ā Backend Coding Agent (APIs + DB) ā Frontend Coding Agent (UI + API calls) ā Debugging loops (frontendtest, backendtest) ā FullStack-Bench evaluation.
Step-by-Step Details
- Planning Agent š Hook: Imagine a blueprint before building a house so the plumber and painter donāt bump into each other. š„¬ The Concept: The Planning Agent produces a precise, typed JSON plan for backend (entities, API endpoints, schemas) and frontend (pages, components, data flows).
- How it works: a. Reads your instruction. b. Lists pages, sections, and data needs (frontendPlan). c. Lists entities, API routes, request/response shapes (backendPlan). d. Uses granular types (e.g., array<string>, object<{id:number}>).
- Why it matters: If the plan is fuzzy, the frontend and backend wonāt match; data wonāt flow. š Anchor: For āa navy-themed recipe app,ā it defines /api/recipes, request/response fields, and pages like /, /add, /recipe/{id}.
- Backend Coding Agent
- Implements the backend using tools: readfile, writefile, searchfilecontent, glob, runshellcommand.
- Uses Backend Debugging Tool (backendtest): starts the server, sends a method+URL+payload, and returns the response plus logs. š Hook: Think of a chefās taste-test spoonāquick, direct feedback. š„¬ The Concept: Backend Debugging Tool is a one-step API tester.
- How it works: a. Start backend. b. Make a request (e.g., POST /api/todos with JSON). c. Return response and console logs.
- Why it matters: Without it, testing needs many manual shell steps and is error-prone. š Anchor: Testing POST /api/signup shows 201 Created and a DB INSERT; if not, fix and retest.
- Frontend Coding Agent
- Builds pages and components that call backend APIs; no mock data allowed when dynamic data is required.
- Uses Frontend Debugging Tool (frontendtest): launches the app and drives a GUI-agent with a natural-language instruction. š Hook: Like a game tester playing through levels while watching the console for hidden errors. š„¬ The Concept: Frontend Debugging Tool is a smart UI test that watches both the screen and the error logs.
- How it works: a. Start dev server(s). b. A GUI-agent clicks/inputs as instructed. c. Tool monitors browser console and terminal; on error, asks the GUI-agent which action caused it and reports details back. d. Dynamically generates atomic test cases for current progress.
- Why it matters: Without error-aware tests, the agent may chase symptoms instead of causes, wasting time. š Anchor: If clicking āAdd to Cartā throws an exception in the console, the tool flags that step so the agent can fix that component.
- Dynamic Code Navigation š Hook: Itās like using a live map to explore a big city instead of carrying one giant paper map. š„¬ The Concept: Dynamic code navigation means using tools (glob, searchfilecontent, readfile) to locate and edit only whatās needed.
- How it works: a. Search file names and content by patterns. b. Read only relevant files or ranges. c. Edit the right spots with writefile/replace.
- Why it matters: Without it, agents stuff everything into context, get lost, and make mistakes. š Anchor: To change how /api/orders validates data, the agent searches for āorders controllerā and edits only that file.
- FullStack-Learn: Repository Back-Translation š Hook: Rebuilding a finished LEGO model step-by-step teaches you the true assembly order. š„¬ The Concept: Convert existing repos into build-from-scratch trajectories.
- How it works: a. Information Gathering Agent summarizes repo purpose, quality, plus backend/frontend plans and a plausible user instruction. b. Trajectory Back-Translation Agent reproduces the repo into a clean template, guided by the plans. c. A rule-based transformer cleans references to the original repo and replays tool calls for consistent outputs.
- Why it matters: It creates high-quality, logically ordered training data grounded in real-world projects. š Anchor: A Next.js + NestJS blog is read, then faithfully re-implemented into a fresh template with the same features.
- FullStack-Learn: Repository Augmentation š Hook: Practice scales by playing the same tune in a new key. š„¬ The Concept: Scale data by modifying repos in meaningful ways.
- How it works: a. Augmentation Planning Agent proposes five plans: 1 simplify, 1 extend, 3 parallel apps. b. Augmentation Implementing Agent applies each plan, runs debugging tools, and self-verifies changes; only successful samples are kept.
- Why it matters: You get 5Ć more diverse, validated training trajectories without writing everything from scratch. š Anchor: Turn an event app into a recipe catalog while keeping the same folder structure and routing style.
- Iterative Self-Improvement Training š Hook: Practice, analyze, practice moreābut smarter. š„¬ The Concept: Train in two rounds to get better using your own generated data.
- How it works: a. Round 1: Back-translate real GitHub repos (about 2K trajectories), train. b. Round 2: Augment repos (about 8K more trajectories), back-translate, merge with round 1 (10K total), train again.
- Why it matters: The model learns from accurate, multi-step, end-to-end examples and becomes a stronger full-stack coder without a larger teacher model. š Anchor: After round 2, the 30B modelās frontend/backend/database scores jump by 9.7/9.5/2.8 points.
- FullStack-Bench (Evaluation) š Hook: A triathlon tests swimming, biking, and runningānot just one. š„¬ The Concept: A benchmark that tests frontend, backend, and database separately and together.
- How it works: a. Frontend: GUI-agent interacts with pages; results count only if DB logs match expected operations. b. Backend: Judge discovers APIs, then sends requests; checks responses and DB logs. c. Database: Judge inspects schema and sample rows to verify requirements.
- Why it matters: UI-only scoring can be fooled; DB-validated scoring canāt. š Anchor: A ācreate taskā test is only YES if the UI shows success AND a tasks INSERT appears in logs.
Secret Sauce (Whatās Clever)
- Typed planning that locks down data shapes end-to-end.
- Two specialized debugging tools that watch logs and streamline API testing.
- Back-translation + augmentation that turn real repos into high-quality training lessons.
- Evaluation that refuses to pass āfake-itā frontends by checking database evidence.
04Experiments & Results
The Test: The authors propose FullStack-Bench, which reuses 101 website instructions and adds 647 frontend, 604 backend, and 389 database test cases. Frontend checks are driven by a GUI-agent and only count when database logs confirm proper operations. Backend tests discover and call APIs, judging YES/NO. Database tests check table schemas and first rows to confirm data requirements.
The Competition: Baselines include WebGen-Agent, TDDev, OpenHands, Bolt.diy, and Qwen-Code. Because many baselines default to frontend-only code, the authors explicitly prompt them to include backends and database usage when needed. Models used include Qwen3-Coder-30B-A3B-Instruct and Qwen3-Coder-480B-A35B-Instruct.
The Scoreboard (with context):
- FullStack-Dev + 480B model achieved 64.7% (frontend), 77.8% (backend), and 77.9% (database). Thatās like getting an A in backend and database and a solid B in frontend on a tough exam that checks the plumbing. It outperforms the previous state-of-the-art, WebGen-Agent (56.0%, 39.6%, 62.0%), by +8.7%, +38.2%, and +15.9% respectively. Appearance scores are also highest.
- FullStack-Dev + 30B model leads among 30B peers: 37.2% (frontend), 38.7% (backend), and 50.9% (database).
- Self-improvement (FullStack-Learn) on 30B: After round 1 (2K back-translated crawled repos), then round 2 (add 8K augmented repos), the final model improves over the original 30B by +9.7% frontend, +9.5% backend, +2.8% database, and a higher appearance scoreāwithout using a bigger teacher model.
Surprising Findings:
- Backend accuracy can exceed frontend accuracy in the authorsā system. Many baselines still skew frontend-heavy, sometimes using mock data. FullStack-Devās debug tools and planning likely make backend work more systematic, pushing those scores up.
- Removing the Backend Debugging Tool increases the average iterations needed by the backend agent (from ~74.9 to ~115.5), highlighting how much it speeds up development.
- Data quality matters: training on back-translated trajectories from real repos yields big gains; training on directly-generated trajectories from generic instructions barely helps.
Ablations:
- Remove multi-agent mechanism: all metrics dropācoordination matters.
- Remove Backend Debugging Tool: backend scores drop more (as expected); remove Frontend Debugging Tool: frontend scores drop more.
- Remove both debugging tools: large decreases in both frontend and backend.
- Data generation method: back-translation vs. directly-generated 2K shows that real-repo-based data substantially outperforms synthetic instruction-only data.
- Reliability: Human alignment checks on 200 samples per test type show >90% agreement (frontend ~90.5%, backend ~94.0%, database ~97.5%), suggesting the benchmarkās judging is trustworthy.
Extra Notes:
- Templates: Main runs use Next.js (frontend) and NestJS (backend) for stability. Adding Vue.js and Django (allowing the system to pick) slightly boosts scores, showing adaptability.
- Compute: For 30B self-improvement, they trained 2 epochs per round at lr=2e-5, batch size 32 on 32 H800 GPUs; trajectories were decontaminated against the benchmark via 5-gram Jaccard and sentence-embedding filters.
- Inference constraints: context length up to 131,072 and up to 400 tool calls.
Takeaway: The combination of a clear plan, specialized debugging tools, learning from real repos at scale, and DB-validated evaluation gives this system a sizable, reliable edgeāespecially on the backend and database parts that most earlier systems neglected.
05Discussion & Limitations
Limitations (Be Specific):
- Framework Bias: Most data and experiments center on popular templates (e.g., Next.js, NestJS), even though expansions to Vue.js/Django work. Niche stacks might need extra prompts/tools.
- Data Dependency: Back-translation quality depends on having solid, crawlable repositories. Low-quality or license-restricted code could limit usable training data.
- Debugging Coverage: The tools are strong, but deeply subtle bugs (race conditions, complex auth flows, third-party API quirks) may still escape.
- Compute Cost: Generating, filtering, and training on 10K multi-step trajectories with 131k context and many tool calls is resource-intensive.
- UI Judgment Complexity: Even with DB checks, some nuanced frontend expectations (pixel-perfect design, animation timing) can be hard to score automatically.
Required Resources:
- Powerful LLMs (e.g., 30Bā480B) with long context windows.
- Reliable sandboxes that can run full-stack dev servers and databases (PostgreSQL in this study).
- Substantial GPU time for iterative self-improvement and sufficient storage for many trajectories and logs.
When NOT to Use:
- Ultra-tight latency settings where tool calls and server spins are too slow (e.g., on-device coding assistants with few resources).
- Strictly static sites that donāt need a backend/database (though the system can choose to skip backend, simpler tools may suffice).
- Highly regulated codebases where training data cannot be derived from repos and trace cleaning/back-translation are not allowed.
Open Questions:
- How well does this generalize to microservices, serverless architectures, or real-time websockets at scale?
- Can reinforcement learning or execution-aware losses further amplify gains when paired with back-translation?
- How to extend database validation to cover complex transactional integrity, migrations, and multi-tenant setups?
- What is the best way to model long-lived development cycles (refactors, version bumps, dependency churn) over months?
- Can smaller, cheaper models get similar gains with more clever data selection or tool designs?
06Conclusion & Future Work
3-Sentence Summary: FullStack-Agent is a complete system for making real full-stack websites using an AI team that plans, codes, and debugs with purpose-built tools. It teaches itself from real repositories via back-translation and augmentation and is graded by a benchmark that only passes work when the database truly changes. This shifts web generation from pretty demos to production-like functionality.
Main Achievement: Proving that a planner+coders architecture, paired with specialized debugging, repository back-translation, and DB-validated evaluation, delivers large, reliable gainsāespecially on the historically weak backend/database partsāwithout needing a larger teacher model.
Future Directions: Broaden to more stacks (e.g., microservices/serverless), deepen database testing (transactions, migrations, policies), add longer-horizon maintenance skills (package updates, refactors), and explore RL or execution-aware learning on top of back-translation.
Why Remember This: It demonstrates a path from ālooks rightā to āworks rightā by aligning planning, training data, tools, and tests with how real software gets built and verifiedāraising the bar for trustworthy AI-generated software.
Practical Applications
- ā¢Rapidly prototype startup MVPs with true data flow (no mock data) and real database writes.
- ā¢Auto-generate internal dashboards where every chart and table pulls from verified APIs.
- ā¢Build CRUD admin panels that are validated by backend and database tests out of the box.
- ā¢Convert legacy single-page demos into production-like full-stack apps using back-translation learning.
- ā¢Create learning curricula for junior devs by turning real repos into step-by-step build lessons.
- ā¢Automate smoke tests for APIs and pages with the specialized debugging tools to catch regressions early.
- ā¢Spin up hackathon projects that pass database-verified benchmarks, not just UI checks.
- ā¢Scale training data for code models by augmenting existing repositories into diverse variants.
- ā¢Evaluate third-party website generators with FullStack-Bench to ensure real backend/database functionality.
- ā¢Support migration between frameworks (e.g., Next.js to Vue.js) by reproducing functionality via back-translation.