LLM-in-Sandbox Elicits General Agentic Intelligence
Key Summary
- •This paper shows that giving an AI a safe, tiny virtual computer (a sandbox) lets it solve many kinds of problems better, not just coding ones.
- •Without extra training, strong AIs naturally use the sandbox to browse for tools, store big documents as files, and run programs to check their answers.
- •A special training method (LLM-in-Sandbox-RL) teaches weaker AIs to explore the sandbox using only general, everyday data placed as files, so they get much better too.
- •Across math, physics, chemistry, biomedicine, long documents, and instruction-following, sandbox mode often beats regular chat mode by a lot (up to +24.2% on math).
- •For huge documents, putting the text into files in the sandbox instead of the prompt cut tokens by up to 8× (about 100,000 down to 13,000), saving cost.
- •Even though sandbox runs more steps, it can be just as fast or faster because environment outputs are processed quickly (MiniMax saw about 2.2× higher throughput).
- •Training inside the sandbox didn’t just help sandbox mode; it unexpectedly improved normal chat mode reasoning and self-checking too.
- •The sandbox is lightweight (about 1.1 GB image; ~50–200 MB RAM per container) and scales to lots of users with little overhead.
- •The system can even create real files like maps, posters, videos, and music by installing tools and running programs, going beyond plain text answers.
- •This approach could become the default way to serve AIs: safer, more verifiable, better with long context, and able to produce real, usable artifacts.
Why This Research Matters
Many real tasks need more than words: we look things up, keep organized files, and run apps to check results. This approach lets AIs do the same safely, making them far more helpful for homework, reports, research, and creative projects. It also lowers costs by putting giant texts into files instead of prompts, and it can be as fast or faster thanks to efficient processing of environment outputs. We can now get real artifacts (like a poster.png or a map.html) instead of just descriptions, which is much closer to what people actually need. By training with general data inside the sandbox, even smaller models learn to explore purposefully and become broadly useful. Over time, this could become the default way we use AI: not just chatting, but actually getting things done.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine you’re doing homework with a computer nearby. You don’t try to remember everything; you search, save notes in folders, and run apps to check your work. Computers make hard tasks easier.
🥬 The concept (Understanding Large Language Models): A large language model (LLM) is a computer program that reads and writes text very well. How it works: 1) It studies tons of text, 2) Learns patterns of words, 3) Predicts what to say next, 4) Uses prompts (your instructions) to focus. Why it matters: Without knowing what an LLM is good at—and not good at—we can’t design the right help for it. 🍞 Anchor: When you ask an LLM, “Explain rainbows to a 6th grader,” it picks kid-friendly words because it’s learned patterns from lots of kid-friendly explanations.
The World Before: LLMs got better through a few big ideas. First, in-context learning showed you could teach them a new task by example, right in the prompt, without retraining. Then chain-of-thought prompting encouraged step-by-step reasoning. More recently, agentic frameworks let models use tools over multiple turns. Still, many hard problems remained: huge documents exceeded context windows, strict formatting was tricky, and some tasks needed special software.
🍞 Hook: You know how sometimes you can’t copy a giant chapter into a tiny sticky note? That’s what long prompts feel like to AIs.
🥬 The concept (Prompt Engineering): Prompt engineering is carefully writing instructions so the AI knows exactly what to do. How it works: 1) State the goal, 2) Add examples, 3) Specify the format, 4) Set constraints. Why it matters: Vague prompts lead to vague answers. Without good prompts, the AI may miss steps. 🍞 Anchor: “Solve and put only the number in answer.txt” is clearer than “Please help.”
🍞 Hook: Think of a group project where you talk back and forth, checking work each step. That’s more effective than a single long speech.
🥬 The concept (Multi-turn Interaction): Multi-turn interaction is a back-and-forth between the AI and its tools or environment. How it works: 1) Plan a step, 2) Take an action, 3) Read results, 4) Decide next step. Why it matters: Without feedback at each step, the AI can’t correct mistakes or explore better paths. 🍞 Anchor: A model runs a script, sees an error, fixes it, and tries again—like debugging with training wheels.
The Problem: Even with better prompting and multi-turn chats, many tasks need a computer’s basic powers: searching the web, managing big files, and running programs. LLMs alone can’t install packages, scan folders, or execute code safely. Also, weaker models “wander” when given tools, wasting steps without making progress.
Failed Attempts: Fixed tool-use systems gave LLMs a few prebuilt buttons (like a calculator API) but couldn’t cover every task. Software-engineering sandboxes were powerful but heavy: they needed big, per-task environments and were hard to scale to thousands of problems. Text-only RL training improved writing but didn’t teach how to use computers.
The Gap: We needed a light, general, safe virtual computer with three meta-capabilities—external resource access, file management, and code execution—plus a way to teach models to explore it using only general, non-specialized data.
🍞 Hook: Training a puppy works best when it gets treats for doing the right thing.
🥬 The concept (Reinforcement Learning Basics): RL teaches a model by rewarding good outcomes. How it works: 1) Try something, 2) Get a score (reward), 3) Adjust behavior to get higher scores next time. Why it matters: Without rewards tied to success, the model can’t learn which actions actually help. 🍞 Anchor: If the model writes the right answer into answer.txt, it gets a point; wrong file or wrong answer gets no point.
Real Stakes: In everyday life, we read long instructions, use folders and apps, and sometimes look up missing knowledge. An AI that can do the same—safely—could help with school projects, science problems, research summaries, structured writing, and even making real files (maps, posters, videos, music) instead of just describing them. It can also save money: storing huge texts as files instead of prompts drastically cuts token use.
02Core Idea
🍞 Hook: Imagine giving a very smart student a simple, safe laptop during a test. Suddenly, they can search, save, and compute—within the rules.
🥬 The concept (LLM-in-Sandbox): LLM-in-Sandbox puts an AI inside a tiny, safe virtual computer so it can explore, fetch tools, manage files, and run code to solve many kinds of tasks. How it works: 1) Start a clean sandbox, 2) The AI takes tool actions (like running bash or editing files), 3) It reads results and adapts, 4) It writes the final answer to a file and submits. Why it matters: Without a sandbox, the AI is stuck in text-only land; it can’t handle huge contexts, install the right tools, or verify answers by running programs. 🍞 Anchor: The AI installs a chemistry library, converts names to molecules, runs a script, and outputs the predicted property in answer.txt.
The “Aha!” Moment in one sentence: If you let an AI safely use a minimal computer with three meta-skills—external resource access, file management, and code execution—it can generalize to many non-code tasks, and you can teach weaker AIs to do this using only regular, general data.
Three analogies:
- Playground with a toolbox: The sandbox is a safe playground where the AI can try swings (internet), slides (files), and monkey bars (programs) to solve different challenges.
- Chef in a kitchen: The AI (chef) can fetch new ingredients (install packages), file recipes neatly (organize files), and cook meals (run code) to meet picky orders (strict formats).
- Detective with a lab: The AI can search archives (web), label evidence (file ops), and test clues (code) to crack tough cases (math, physics, biomed).
Before vs. After:
- Before: The AI tried to solve everything in one prompt, struggled with long texts, and guessed formats.
- After: The AI splits work: store big data as files, compute answers with scripts, verify results, and only then write the clean final output.
Why it works (intuition, no equations):
- Computers give reusable, verifiable steps. Running a script is like doing the same reliable recipe every time.
- Files act like long-term memory, so the AI isn’t limited by short prompts.
- Internet/tool access lets the AI extend itself on demand, not rely only on what it already knows.
- Feedback every step (multi-turn) helps it refine plans and avoid dead ends.
Building Blocks (with mini “Sandwich” explanations):
🍞 Hook: You know how sometimes you need to look something up or download an app to finish a project? 🥬 The concept (External Resource Access): The sandbox lets the AI fetch packages or data from outside (like the internet). How it works: 1) Make a web request or install a package, 2) Receive files or tools, 3) Use them in scripts, 4) Keep results locally. Why it matters: Without it, the AI can’t add new skills or knowledge on the fly. 🍞 Anchor: For chemistry, the AI installs OPSIN and turns a chemical name into a molecular structure.
🍞 Hook: Big binders don’t fit on sticky notes; you need folders. 🥬 The concept (File Management): The AI reads, writes, and organizes files in the sandbox. How it works: 1) List files, 2) Search text (grep/sed), 3) Save scripts and notes, 4) Output final answers to a known path. Why it matters: Without files, long documents would overflow the prompt and get expensive. 🍞 Anchor: The AI searches a 100k-token report with grep, extracts just the needed lines, and saves a summary.
🍞 Hook: Following a recipe beats guessing ingredients. 🥬 The concept (Code Execution): The AI runs code to compute, simulate, and verify answers. How it works: 1) Write a script, 2) Run it, 3) Read results, 4) Adjust and rerun if needed. Why it matters: Without code, tricky constraints (like equal-length sentences with no shared words) are nearly impossible to satisfy reliably. 🍞 Anchor: The AI writes a Python script to check lengths and word overlap, then auto-searches for valid sentence sets.
Teaching Weaker Models (LLM-in-Sandbox-RL):
🍞 Hook: Practice in the real environment beats reading about it. 🥬 The concept (LLM-in-Sandbox Reinforcement Learning): Train the AI inside the sandbox using regular, general tasks whose materials are stored as files; reward only the correct final outcome. How it works: 1) Put context docs in /testbed/documents, 2) The AI must explore (list, open, search), 3) It assembles the answer and writes to answer.txt, 4) It gets a reward if correct. Why it matters: Without sandbox-based practice, weaker models wander; with it, they learn purposeful exploration that generalizes widely. 🍞 Anchor: A small model learns to open the right document chunk, compute the number, and place just the final answer in the output file.
03Methodology
At a high level: Input → Configure sandbox (place files, minimal tools) → Multi-turn Reason-Act loop (choose a tool, run it, read feedback) → Write final answer to /testbed/output/answer.txt → Submit.
Step 1: Configure the sandbox
- What happens: Start a lightweight Docker-based Ubuntu environment with Python and basic scientific libs. Put any big input materials (e.g., long PDFs converted to text) into /testbed/input or /testbed/documents. No heavy, task-specific images.
- Why it exists: Keeps the system fast, safe, and general. Without it, you’d either bloat storage or lack the tools to explore.
- Example: A 100,000-token industry report is split into chapter files under /testbed/documents.
Step 2: Prompt for exploration
- What happens: The system prompt gently forces computation: “Write code, run it, don’t hardcode answers, store outputs in /testbed/output.” It also reminds the model to explore: install packages if needed, try scripts, and verify results.
- Why it exists: Without a clear push toward computation and file output, the model may handwave answers or forget to produce the final file.
- Example: “Please compute and place only the final integer into /testbed/output/answer.txt.”
🍞 Hook: Like texting back and forth while solving a puzzle. 🥬 The concept (Multi-turn Interaction): The model acts, sees results, and plans the next move. How it works: 1) Think: choose a tool call, 2) Act: run it, 3) Observe: read outputs, 4) Repeat until done. Why it matters: Without this loop, the model can’t refine its plan based on real feedback. 🍞 Anchor: It runs a script, sees an import error, installs the missing package, and tries again.
Step 3: Choose and use tools (the “recipe”):
-
Tool A: execute_bash (run any terminal command)
- What happens: Install packages (pip, apt), run Python scripts, use grep/sed, download files.
- Why it exists: This is the universal controller of the computer. Without it, you can’t expand abilities or execute logic.
- Example data: “pip install rdkit-pypi -q” to add a chemistry toolkit.
-
Tool B: str_replace_editor (view/create/edit files)
- What happens: Create helper.py, insert code, view numbered lines, replace exact line ranges.
- Why it exists: Lets the model build real programs and iterate. Without it, no scripts = no reliable computation.
- Example data: Create extract_industries.py with a regex to find “infringement” sentences.
-
Tool C: submit (finish)
- What happens: Signals the run is complete; the system collects /testbed/output/answer.txt or all files in /testbed/output.
- Why it exists: Cleanly ends exploration and extracts the final result. Without it, the system wouldn’t know when to stop.
- Example data: After printing the final number to answer.txt, the model calls submit.
Step 4: Observe feedback and iterate
- What happens: The sandbox returns command output and errors (e.g., install logs, script output, grep matches). The model updates its plan.
- Why it exists: Feedback teaches the model what worked. Without it, the model can’t self-correct.
- Example: If “ModuleNotFoundError: rdkit” appears, the model installs the right version or fallback tooling (e.g., OPSIN with Java runtime).
Step 5: Output handling
- What happens: The model must place the final, clean answer in /testbed/output/answer.txt (no explanations) or save generated artifacts (e.g., map.html, poster.png, video.mp4) in /testbed/output.
- Why it exists: Separates messy exploration from the neat final product. Without this, graders and users would get noisy text.
- Example: For instruction-following, the model writes exactly three sentences that pass its checker script, then saves them.
Concrete end-to-end examples:
- Long-context Q: “From the 2023 competition report, how many infringement notices were issued?”
- ls /testbed/documents; 2) grep -n -i 'infringement' ...; 3) sed -n '240,280p' ...; 4) Parse with a short Python script; 5) Write the number to answer.txt; 6) submit.
- Tricky formatting: “Write three medieval-history sentences with identical character counts and no shared words.”
- Create helper.py to count chars and compute word sets; 2) Generate candidates; 3) Filter with code; 4) Save the valid trio to answer.txt; 5) submit.
- Chemistry conversion: “Predict a property from a compound name.”
- apt-get install default-jre; 2) install/locate OPSIN; 3) Convert name → SMILES; 4) Run property estimator; 5) Save the value; 6) submit.
🍞 Hook: Practicing in the actual kitchen makes you a real chef. 🥬 The concept (Reinforcement Learning in this paper): LLM-in-Sandbox-RL trains the model by storing task contexts as files and rewarding only correct final outputs. How it works: 1) Context split into files (and distractors), 2) Prior related tasks become in-prompt examples, 3) Model must explore files, compute, and write answer.txt, 4) Reward = correct final outcome. Why it matters: Without training inside the sandbox, weaker models waste turns; with it, they learn efficient, purposeful exploration that transfers to many tasks. 🍞 Anchor: After training, a small model’s average turns drop from ~24 to ~7 while its use of file, compute, and external tools meaningfully rises.
Secret sauce:
- Minimal, general sandbox: one small image (~1.1 GB), no per-task bloat.
- Meta-capabilities instead of fixed tools: the AI can acquire any tool it needs at run-time.
- File-based long context: slash tokens and cost by putting big text in files, not prompts.
- Outcome-only rewards: simple, scalable training that still teaches real exploration.
- Skills transfer back: sandbox training improved even normal chat-mode organization and self-checking.
04Experiments & Results
The Test: The authors compared regular chat mode (no sandbox) vs. sandbox mode across six non-code domains: Mathematics (AIME25), Physics (UGPhysics), Chemistry (ChemBench), Biomedicine (MedXpertQA), Long-Context (AA-LCR), and Instruction Following (IFBench). They also introduced sandbox-based RL training and measured generalization. They checked both accuracy and practical efficiency (tokens and speed).
The Competition: Models ranged from frontier agentic AIs (Claude-Sonnet-4.5-Think, GPT-5, DeepSeek-V3.2-Thinking, Kimi-K2-Thinking, MiniMax-M2) to specialized/open ones (Qwen3-Coder-30B-A3B) and a smaller general model (Qwen3-4B-Instruct).
The Scoreboard (with context):
- Big gains in sandbox mode for strong models across domains. Example: Qwen3-Coder on Math improved by +24.2%, like jumping from a B to a solid A.
- Instruction-following: Claude improved by +12.7 points; DeepSeek by +14.4—like going from “mostly right” to “consistently nailing picky rules.”
- Chemistry and Biomedicine saw steady bumps too (e.g., GPT-5 Chem +0.5; Physics +5.2), showing tool installs and computations pay off.
- Some weaker models got worse in sandbox mode without training (e.g., Qwen3-4B lost points), confirming they “wander” unless taught to explore.
Long-context efficiency and accuracy:
- Putting documents into sandbox files instead of the prompt boosted accuracy for many models and slashed tokens by up to 8× (e.g., ~100k → ~13k). That’s like carrying a backpack instead of stuffing everything in your pockets.
- Average token use across all tasks dropped to roughly 0.5–0.8× compared to normal chat mode, despite multi-turn exploration, thanks to offloading big text to files.
Speed and system efficiency:
- Environment outputs are processed fast (via prefill), so even though there are more steps, total time stays competitive. MiniMax achieved ~2.2× throughput, with others around 0.6×–1.1×.
- Sandbox overhead is tiny: ~1.1 GB image; ~50–200 MB RAM per container; thousands of runs don’t require terabytes of per-task images.
Training weaker models (LLM-in-Sandbox-RL):
- Using only general, non-specialized data placed as files, sandbox-RL made weaker models strong in sandbox mode and even improved their normal chat mode. For Qwen3-4B, sandbox-RL bumped scores widely (e.g., Physics +14.8, Instruction-Following +9.0 in sandbox mode) and reduced average turns (~23.7 → ~7.0) while increasing purposeful tool use.
- Surprisingly, sandbox-RL improved plain chat reasoning structure and self-checking too, beating text-only LLM-RL on many tasks. It’s like learning to cook in a real kitchen makes your recipe-writing better.
Cross-domain generalization:
- Training on math or SWE data helped other areas some—but training on general context-as-files worked best overall, showing the exploration skill itself is what transfers.
Surprising findings:
- Training inside the sandbox improved both sandbox and non-sandbox modes. Agentic skills (explore, verify, structure) transfer back to plain text generation.
- File-based context placement mattered a lot; putting context in files forced true exploration and led to broader gains than stuffing context into the prompt.
05Discussion & Limitations
Limitations:
- Weaker models need training; without sandbox-RL, they may take many turns without using tools effectively, hurting performance.
- Quality ceilings: Creative outputs (videos, music, posters) are functional but not yet at pro-level artistry.
- Package and network variability: External installs can fail or change over time; sandbox runs may need caching or pinning versions.
- Evaluation noise: Some domains used LLM-as-judge; though careful, such judging can be imperfect.
- Security/policy: While containerized and isolated, real-world deployments must restrict network access and unsafe commands.
Required resources:
- A lightweight Docker image (~1.1 GB) and modest per-container RAM (~50–200 MB).
- An inference backend (vLLM or SGLang) and optional GPUs for speed.
- Controlled network access for safe, reproducible package/data fetches.
- For sandbox-RL: a rollout system with outcome-based rewards and file I/O.
When not to use it:
- Tiny, quick Q&A where a single reply is enough and latency must be minimal.
- Strictly offline or no-network environments where external tools/resources are required.
- Highly regulated tasks needing licensed or non-redistributable software that can’t be installed in the sandbox.
- If the answer must never leave a specific secure enclave without audited tooling.
Open questions:
- What’s the best curriculum to teach tool literacy—start with files, then code, then web?
- How to robustly attribute rewards to the right actions in long multi-turn traces?
- Can we standardize safe, reproducible tool acquisition (hash-pinned packages, mirrors)?
- How to benchmark agentic skill directly (exploration, verification) beyond task accuracy?
- Could we pretrain with synthetic sandbox traces so models are “sandbox-native” from day one?
06Conclusion & Future Work
Three-sentence summary: This paper gives AIs a safe, tiny virtual computer (a sandbox) with three core powers—external access, file management, and code execution—so they can solve many tasks better than plain chat. Strong models immediately benefit across math, science, long documents, and picky instructions; weaker models learn to benefit via a simple outcome-only RL method that uses general data in files. It’s efficient, scalable, and even improves normal chat reasoning, not just sandbox runs.
Main achievement: Showing that a minimal, general-purpose code sandbox plus outcome-only sandbox training elicits broadly general agentic intelligence—improving accuracy, cutting tokens for long texts, and enabling real artifact creation (HTML, PNG, MP4, WAV).
Future directions: Make sandbox-native pretraining standard; build richer but still light environments; add safety and reproducibility layers for tool installs; refine agentic benchmarks that measure exploration and verification; and teach models to autonomously plan tool acquisition. Expect growing cross-modal creation directly from text instructions.
Why remember this: It reframes AI from “text in, text out” to “text in, real computer out,” turning models into dependable digital workers who can search, store, compute, verify, and produce real files—reliably, efficiently, and at scale.
Practical Applications
- •Summarize huge documents by searching and extracting key parts from files, then saving a clean summary to answer.txt.
- •Create validated math and science solutions by running verification scripts and simulations before submitting results.
- •Generate strict-format content (e.g., same-length sentences, no shared words) using code to enforce constraints.
- •Produce real artifacts like interactive maps, posters, videos, or music by installing libraries and running build scripts.
- •Automate data cleaning and analysis: load CSVs, process them with Python, and save charts or reports to /testbed/output.
- •Extend knowledge on demand by installing domain-specific packages for chemistry, biomedicine, or physics tasks.
- •Handle long-context projects cheaply by storing materials as files and using grep/sed/Python to locate relevant info.
- •Improve instruction-following by using programmatic checkers to ensure outputs meet every rule before submission.
- •Teach smaller models to explore effectively via sandbox-RL on general data, upgrading their usefulness without niche datasets.
- •Prototype complex workflows safely in a container with minimal setup, then export the final outputs for real use.