MDAgent2: Large Language Model for Code Generation and Knowledge Q&A in Molecular Dynamics

Zhuofan Shi; Hubao A; Yufei Shao; Dongliang Huang; Hongxu An; Chunxiao Xin; Haiyang Shen; Zhenyu Wang; Yunshan Na; Gang Huang; Xiang Jing

MDAgent2: Large Language Model for Code Generation and Knowledge Q&A in Molecular Dynamics

Intermediate

Zhuofan Shi, Hubao A, Yufei Shao et al.1/5/2026

arXiv PDF

Key Summary

•MDAgent2 is a special helper built from large language models (LLMs) that can both answer questions about molecular dynamics and write runnable LAMMPS simulation code.
•The team built three high-quality datasets for the MD world: one for general knowledge, one for Q&A, and one for text-to-code examples.
•They trained models in three steps—continued pretraining, supervised fine-tuning, and reinforcement learning with real execution feedback—so the model learns the science, follows instructions, and improves from its own results.
•A new method called MD-GRPO lets the model run the code it writes, score the results, and use those scores to get better next time, even reusing its failures to learn more.
•They also created a multi-agent runtime that automatically generates code, runs it in a safe sandbox, evaluates the outputs, and self-corrects.
•To measure progress fairly, they built MD-EvalBench, the first benchmark focused on LAMMPS knowledge, syntax, and code generation.
•On code generation, the runtime loop boosted MD-Code-8B Exec-Success@3 from 14.23% to 37.95%, a big jump in producing executable scripts.
•MD-Instruct-8B, a small 8B model, outperformed larger open models on some Q&A tasks after domain training, showing that smart specialization can beat size.
•MDAgent2 outperforms earlier agent systems by adding LAMMPS-specific tools and execution-based feedback.
•This work shows how LLMs can safely and efficiently help scientists run real simulations, saving time and reducing errors in labs and industry.

Why This Research Matters

This work helps scientists run real simulations faster and with fewer mistakes, which means discovering better materials sooner. Safer batteries, stronger metals for cars and planes, and cooler-running electronics all start with reliable simulations. By making a small, deployable model that learns from actual runs, labs can keep their data local and still get strong results. The system also lowers the barrier for newcomers, turning complex scripts into guided, automated workflows. In industry, this reduces costly trial-and-error and accelerates product design. In education, it becomes a tireless tutor and lab assistant for students. Overall, MDAgent2 turns AI from a talker into a doer in scientific computing.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re trying to bake a cake, but the recipe is written in a secret chef language. You know what you want—chocolate cake—but you can’t read the steps. That’s how many scientists feel when they want to run Molecular Dynamics (MD) simulations but must write tricky LAMMPS scripts.

🥬 The Concept (Molecular Dynamics):

What it is: Molecular Dynamics is a way to watch atoms move and interact over time using the rules of physics.
How it works:
1. Put atoms in a virtual box.
2. Tell the computer the forces between them (the potential).
3. March time forward in tiny steps and update positions and velocities.
4. Record things like temperature and energy.
Why it matters: Without MD, testing materials is slow and expensive; with MD, scientists can try ideas quickly on a computer. 🍞 Anchor: Like a slow-motion video of marbles bumping on a table, except the marbles are atoms and the rules come from physics.

🍞 Hook: You know how recipes need exact measurements and steps? MD uses a program called LAMMPS, and it’s very picky about the recipe format.

🥬 The Concept (LAMMPS scripts):

What it is: LAMMPS scripts are step-by-step instructions for running MD simulations.
How it works:
1. Define units, atom types, and the box.
2. Set the interatomic potential (the force rules).
3. Choose how to control temperature/pressure.
4. Run the simulation and save outputs.
Why it matters: A tiny mistake (like a misspelled command or wrong units) can crash the run or make nonsense results. 🍞 Anchor: It’s like programming a robot chef: if you say “beke” instead of “bake,” dinner never happens.

🍞 Hook: Think of a super-avid reader who remembers loads of books and can write helpful answers.

🥬 The Concept (Large Language Models):

What it is: LLMs are computer programs that learn patterns in language to understand and generate text.
How it works:
1. Read tons of text to learn word patterns.
2. Given a prompt, predict the next token over and over.
3. With fine-tuning, learn to follow instructions or write code.
Why it matters: LLMs can act like smart assistants, but they need the right knowledge and training for specialized jobs like MD. 🍞 Anchor: Ask, “What’s the capital of France?” and it answers “Paris.” Ask, “Write a LAMMPS script,” and—if trained well—it writes one.

🍞 Hook: If you’ve ever asked a friend to write a program for your science fair project, you know that clear instructions and practice make a big difference.

🥬 The Concept (Code Generation):

What it is: Code generation means an AI writes computer instructions for you.
How it works:
1. Read the task description.
2. Plan the solution.
3. Produce code in the target language (here, LAMMPS).
4. Optionally run and fix it.
Why it matters: It speeds up work and reduces human typos—but only if the code actually runs and matches the science. 🍞 Anchor: Tell an AI “simulate heating a copper nanoparticle,” and it outputs a ready-to-run LAMMPS script.

The world before: Scientists relied on experts to handcraft LAMMPS inputs. It could take hours to days per case, and small mistakes (wrong units or missing potential files) would waste entire runs. LLMs began helping with generic coding and Q&A, but they stumbled in MD because domain data was scarce, tools weren’t integrated, and there was no way for the AI to ‘learn from trying’—it only generated code once (one-shot) without running it.

The problem: How do we make an AI that both understands MD concepts and writes LAMMPS code that executes correctly and respects physics, while being small enough to run locally?

Failed attempts: Direct prompting (just ask a general LLM) produced code that often didn’t run. Agent systems without execution feedback couldn’t steadily improve. Big closed models were powerful but too costly or not deployable. And there was no good MD-specific benchmark to measure progress fairly.

The gap: The field needed (1) a clean, MD-focused data pipeline, (2) a training recipe that first absorbs MD knowledge and then practices following instructions and coding, (3) a closed loop where the model runs its own code and learns from the outcomes, and (4) a benchmark to track real improvement.

Real stakes: Faster, safer, and cheaper materials discovery affects everyday life—better batteries, tougher alloys for cars, heat-resistant chips in phones, and cleaner energy tech. An assistant that writes reliable simulation code can save scientists countless hours and reduce costly trial-and-error.

02Core Idea

🍞 Hook: You know how you get better at a video game by playing, losing, seeing what went wrong, and trying again? Imagine a coding helper that does the same: it writes a simulation, runs it, checks the result, and learns from it.

🥬 The Concept (The Aha!):

What it is: Teach an LLM the language of molecular dynamics, then let it test and correct its own LAMMPS code in a loop.
How it works:
1. Fill the model’s brain with MD knowledge (continued pretraining).
2. Coach it with examples (supervised fine-tuning).
3. Let it write code, run the simulation, score the outcome, and learn (reinforcement learning with execution feedback, MD-GRPO).
Why it matters: Without this loop, the AI keeps making the same mistakes. With it, the AI steadily improves executability and physical correctness. 🍞 Anchor: Like practicing piano: learn notes, follow a teacher, then record yourself and listen to improve.

🍞 Hook: Imagine building a LEGO city. First you sort the bricks (data), then follow instructions (fine-tuning), then stress-test the bridges you built (RL) and fix what breaks.

🥬 The Concept (Three-stage training: CPT → SFT → RL):

What it is: A training pipeline where the model learns domain facts (CPT), learns to follow tasks (SFT), and then learns from outcomes (RL).
How it works:
1. CPT: Read curated MD texts to master terms, syntax, and workflows.
2. SFT: Practice Q&A and code examples to align with how users ask.
3. RL (MD-GRPO): Generate code, run LAMMPS, score results, and update the model toward better scripts.
Why it matters: Skip CPT, and the model speaks broken ‘MD’; skip SFT, and it ignores instructions; skip RL, and it won’t learn from real execution. 🍞 Anchor: It’s like learning a sport: study the rules, drill with a coach, then play matches and adjust from the scoreboard.

🍞 Hook: You know how a school test checks both facts and problem-solving? MD needs both knowledge and working code.

🥬 The Concept (Two models: MD-Instruct and MD-Code):

What it is: MD-Instruct answers MD questions; MD-Code writes LAMMPS scripts.
How it works:
1. MD-Instruct: Trained to understand theory and syntax questions.
2. MD-Code: Trained to translate tasks into runnable code and improve via RL.
Why it matters: Splitting roles lets each model specialize and perform better. 🍞 Anchor: One teammate studies the playbook (MD-Instruct); the other runs the plays on the field (MD-Code).

🍞 Hook: Picture a science fair judge who can run your experiment instantly and score it fairly.

🥬 The Concept (MD-GRPO—closed-loop RL):

What it is: A learning loop where generated code is executed and its outcomes provide rewards to improve future generations.
How it works:
1. The model writes code with a structured answer format.
2. A sandbox runs LAMMPS and collects logs/outputs.
3. A scoring rubric checks syntax, stability, physics, and completeness.
4. The reward nudges the model to produce better code next time; failed tries are reused and rephrased so the model learns what went wrong.
Why it matters: Without real execution, the AI can look right but fail on run; with execution feedback, correctness becomes the habit. 🍞 Anchor: Baking cookies, tasting them, adjusting sugar/time, and getting tastier cookies each batch.

🍞 Hook: Races need a finish line. To know who’s improving, you need a scoreboard.

🥬 The Concept (MD-EvalBench—benchmarking):

What it is: The first MD-focused test suite for knowledge Q&A, LAMMPS syntax, and code generation quality.
How it works:
1. MD-KnowledgeEval: tests theory.
2. LAMMPS-SyntaxEval: tests command understanding.
3. LAMMPS-CodeGenEval: tests end-to-end code generation and execution.
Why it matters: Without a fair test, no one knows what actually got better. 🍞 Anchor: Like grading math with answer keys and also checking if your calculator program runs without crashing.

Before vs After:

Before: One-shot prompts, frequent run-time errors, missing potentials, no standard yardstick.
After: A trained specialist that writes, runs, checks, and fixes code—measured by a focused benchmark.

Why it works (intuition):

Domain reading (CPT) builds the vocabulary; coaching (SFT) molds behavior; practice with real outcomes (RL) locks in reliability. The model learns patterns that correlate with successful, stable simulations—like matching correct units, potentials, and thermostats to the material and task.

Building blocks:

Data construction pipeline (sorted, deduped, high-quality MD text; curated Q&A; task-to-code pairs).
Training stack (CPT → SFT → MD-GRPO).
Specialized tools (syntax checkers, potential managers, evaluators).
Multi-agent runtime (generator → runner → evaluator → fixer).

03Methodology

At a high level: Natural-language task → Data-trained LLMs → Code generation → Sandbox execution → Multi-dimensional evaluation → Self-correction → Final runnable LAMMPS script and report.

🍞 Hook: Think of a factory line that turns a simple idea (“bake a cookie”) into a boxed, labeled treat.

🥬 The Concept (Data construction pipeline):

What it is: A careful, step-by-step process to build three datasets the model needs to learn MD well.
How it works:
1. MD-Knowledge: Collect papers, manuals, and textbooks; clean and deduplicate with regex, MinHash/LSH, and embeddings; auto-rate quality with an LLM.
2. MD-InstructQA: Convert PDFs to Markdown, chunk by structure, build a topic tree, and auto-generate Q&A pairs (with faithful answers) in a consistent schema.
3. MD-CodeGen: Create task templates mixing material, goal, and conditions; generate candidate scripts; have experts review and refine; keep only strong examples.
Why it matters: Without clean, diverse, accurate data, the model memorizes messes and produces unreliable scripts. 🍞 Anchor: Like washing, sorting, and labeling ingredients before cooking—clean inputs make tasty meals.

Training recipe: CPT → SFT → RL (MD-GRPO)

🍞 Hook: Learning a language, then practicing with a tutor, then playing in real conversations.

🥬 The Concept (Continued Pretraining, CPT):

What it is: Keep training the base LLM on MD-specific text so it ‘speaks MD’ fluently.
How it works:
1. Mix MD-Knowledge with a bit of general text to avoid forgetting.
2. Emphasize MD terms, LAMMPS commands, and typical workflows.
3. Build internal representations of units, ensembles, potentials.
Why it matters: Without CPT, the model stumbles on domain terms and formats. 🍞 Anchor: After CPT, “units metal” and “eam/alloy” feel as normal to the model as “hello.”

🍞 Hook: A coach guides you to answer questions and follow instructions clearly.

🥬 The Concept (Supervised Fine-Tuning, SFT):

What it is: Teach the model to follow MD-style instructions and answer precisely.
How it works:
1. Use MD-InstructQA (and some general instruction data) to train concise, faithful answers.
2. Seed with a subset of MD-CodeGen so it sees how tasks map to code.
3. Shape responses to be structured and executable-friendly.
Why it matters: Without SFT, the model may ramble, ignore constraints, or skip key steps. 🍞 Anchor: Given “simulate NPT equilibration at 300 K,” the model outlines the right thermostat, barostat, and outputs.

🍞 Hook: Practice locks in skill—especially when you see what worked and what didn’t.

🥬 The Concept (MD-GRPO—reinforcement learning with execution feedback):

What it is: A loop where the model writes code, runs it, gets a score, and updates its policy to favor better scripts.
How it works:
1. Format reward: The model must output thinking and a strict JSON answer—this keeps responses structured for tools.
2. Correctness reward: A rubric scores eight dimensions (syntax, logic, parameters, completeness, stability, physics, etc.).
3. Trajectory recycling: Low-scoring attempts are recorded, the error cause is captured, and the task is rewritten to teach avoidance next time.
Why it matters: Without rewards tied to real runs, the model can’t connect text to working physics. 🍞 Anchor: Like shooting hoops, checking your stats, and adjusting your aim.

🍞 Hook: A pit crew gets a race car ready: generator (build), runner (test), evaluator (analyze), fixer (tune).

🥬 The Concept (Multi-agent runtime system):

What it is: MDAgent2-RUNTIME, a deployable system that automates generate → execute → evaluate → correct.
How it works:
1. Code Generator: The Writer LLM drafts code, uses a syntax checker and a potential-file manager (lists, fetches, or recommends EAM files), then revises.
2. Code Runner: Executes LAMMPS in a sandboxed Docker for safety and reproducibility; stores logs and dumps.
3. Result Evaluator: Scores outputs across stability, temperature/pressure control, and physical soundness; feeds back scores to trigger fixes.
Why it matters: Without tool help (syntax checks, potentials) and sandbox runs, execution would be fragile and unsafe. 🍞 Anchor: The system sees “Cu–Ni melting,” ensures the right EAM file exists, runs safely, and iterates until results look right.

Evaluation and metrics

🍞 Hook: You can’t improve what you don’t measure.

🥬 The Concept (MD-EvalBench and key metrics):

What it is: A benchmark with three parts and two main metrics for code generation.
How it works:
1. MD-KnowledgeEval (theory) and LAMMPS-SyntaxEval (commands) use multiple question types.
2. LAMMPS-CodeGenEval measures executable code quality.
3. Metrics: Exec-Success@k = whether at least one of k code variants runs; Code Human Score = expert rating 0–10.
Why it matters: Without standard tests, claims are just guesses. 🍞 Anchor: Getting 37.95% Exec-Success@3 is like finding that one of three recipe tries bakes perfectly.

Concrete example with data:

Task: “Simulate melting of a Cu–Ni nanoparticle.”
Steps:
1. Generator drafts code with pair_style eam/alloy.
2. Potential tool flags missing CuNi.eam; recommends CuNi.eam.alloy; code is fixed.
3. Syntax tool verifies validity; Runner executes; Evaluator scores stability and temperature rise.
4. If score < threshold, regenerate and iterate until acceptable.

Secret sauce:

Tight coupling between structured outputs, real execution, and a physics-aware scoring rubric—plus tools that automatically catch common LAMMPS pitfalls (syntax, potentials, timeouts).

04Experiments & Results

The test: The authors assessed two abilities—(1) knowledge and syntax understanding, and (2) end-to-end code generation that actually runs. They used MD-EvalBench, which includes MD-KnowledgeEval (theory), LAMMPS-SyntaxEval (commands), and LAMMPS-CodeGenEval (text-to-code with execution).

The competition: MDAgent2 was compared with Direct Prompting (no tools, no loop), the earlier MDAgent multi-agent system, and strong Qwen3 baselines (open and closed variants). The goal was to see if specialized training plus runtime tools beat general models or simple prompting.

🍞 Hook: Think of a spelling bee (knowledge) and a cooking contest (can you make it taste good?).

🥬 The Concept (What they measured and why):

What it is: Exec-Success@k and Code Human Score for code; total and per-type scores for Q&A and syntax.
How it works:
1. Exec-Success@k: Generates up to k candidates; success if at least one runs.
2. Code Human Score: Experts rate readability, robustness, and physical correctness from 0 to 10.
3. Q&A totals: Aggregated over single/multiple choice, fill-in, and short answer.
Why it matters: Execution success proves practical usefulness; human scores capture quality beyond ‘does it run.’ 🍞 Anchor: It’s like checking if a robot both cooks edible food (executes) and follows a healthy recipe (quality).

Scoreboard with context:

QA ability: MD-Instruct-8B (domain-trained) reached an overall average of 74.67, beating Qwen-Flash and Qwen3-14B, and approaching Qwen3-32B. Qwen3-Max still topped the charts at 82.49, showing that size helps—but smart specialization closes the gap notably for an 8B model.
Syntax vs knowledge: MD-Instruct-8B showed especially solid gains in LAMMPS-SyntaxEval (72.45 vs 65.84 for Qwen3-8B), meaning it learned practical command use.
Code generation: For MD-Code-8B, turning on the MDAgent2-RUNTIME loop improved Exec-Success@3 from 14.23% to 37.95%. The Code Human Score also nudged up from 9.29 to 9.32—already very high—showing scripts were not only executable but also judged high-quality by experts.

🍞 Hook: You know how a coach with the right tools helps an athlete beat last year’s record?

🥬 The Concept (Why the runtime loop matters):

What it is: A generate → run → score → fix cycle that uses LAMMPS-specific tools.
How it works:
1. Syntax checks stop early crashes.
2. Potential-file tools ensure the correct EAM files are present or recommend alternatives.
3. Evaluators measure stability and physics to guide fixes.
Why it matters: Compared to older agents, this system’s toolset catches more real-world problems, lifting execution success. 🍞 Anchor: It’s like having a bike mechanic (tools) at the race—fewer breakdowns, better finish.

Surprising findings:

Small, domain-tuned models can rival much larger general models in specialized tasks; MD-Instruct-8B beat bigger open baselines on average QA.
Execution-based RL mainly boosted executability (large jump in Exec-Success@3), while human-perceived code quality was already strong post-SFT and improved slightly.
LLMs often struggle with choosing potentials; adding potential-specific tools yielded notable reliability gains.

Takeaway numbers in plain words:

37.95% Exec-Success@3 is like getting one solid, runnable script out of three tries—over 2.5× better than the 14.23% baseline.
Code Human Score ~9.3/10 means experts consider the scripts readable and scientifically sensible.
QA averages in the mid-70s for an 8B model indicate robust understanding of both MD theory and LAMMPS syntax after domain training.

Overall, the method’s closed loop and domain-specific tools clearly convert knowledge into working simulations—moving from ‘pretty text’ to ‘pretty runs.’

05Discussion & Limitations

Limitations:

Task coverage: The current datasets focus on thermodynamics, fluids, and mechanical properties; more domains (e.g., reactive force fields, complex multiscale workflows) remain to be added.
Potential function choice: Even with tools, automatic selection of the most physically appropriate potential is challenging and sometimes needs expert review.
Long runs and rare failures: Short execution checks can’t always catch long-horizon instabilities; rare edge cases may slip through.
Closed-source comparison: The strongest closed models (e.g., Qwen3-Max) still hold an edge; some comparisons depend on access and cost.

Required resources:

A machine capable of running an 8B LLM and LAMMPS in Docker (GPU helpful but not strictly required for inference).
Local potential libraries and internet access (optional) for fetching or verifying potentials.
Disk space for logs/dumps; time budget for multi-iteration runs when self-correction triggers.

When not to use:

If you cannot run any execution (no sandbox or LAMMPS), you’ll miss the biggest reliability gains.
If tasks require niche, proprietary potentials or workflows the system hasn’t seen and can’t fetch, manual expert input is safer.
If strict, audited validation is required (e.g., regulatory submissions), human verification remains essential.

Open questions:

How best to teach nuanced physical judgment (e.g., potential selection under subtle material conditions) without overfitting?
Can multimodal feedback (plots, trajectory GIFs) further improve learning and explainability in RL loops?
What’s the right balance between small, deployable models and larger models’ raw capability for this domain?
How transferable is this recipe to other simulators (e.g., GROMACS, OpenFOAM) and to cross-domain pipelines?

In short, MDAgent2 significantly advances robustness and practicality but still benefits from expert oversight for sensitive choices and frontier tasks. Its design—small, specialized models plus an execution-aware loop—offers a strong foundation for future growth.

06Conclusion & Future Work

Three-sentence summary: MDAgent2 teaches an LLM the language of molecular dynamics and then lets it learn from running its own LAMMPS code, closing the loop from question to execution. A carefully built data pipeline (knowledge, Q&A, code) and a three-stage training recipe (CPT, SFT, MD-GRPO RL) produce two specialized models—MD-Instruct and MD-Code—and a runtime that auto-generates, runs, evaluates, and fixes scripts. A new benchmark, MD-EvalBench, shows big gains, especially in execution success.

Main achievement: Turning code generation into a closed-loop, execution-informed skill in a small, deployable model—boosting real-world reliability (Exec-Success@3 from 14.23% to 37.95%) while maintaining expert-level code quality.

Future directions: Expand task coverage (e.g., reactive systems), integrate multimodal signals (plots, trajectories) into training and evaluation, and port the recipe to other simulators and scientific domains. Explore smarter potential selection and longer-horizon stability checks.

Why remember this: MDAgent2 shows how to move from ‘AI that talks science’ to ‘AI that runs science’—teaching models not just to write code but to test, score, and improve it, making simulations faster, safer, and more accessible to labs and industry.

Practical Applications

•Auto-generate LAMMPS inputs for common tasks (e.g., NPT equilibration, MSD, melting curves) with built-in syntax and potential checks.
•Use MD-Instruct as a domain Q&A assistant for quick lookups on ensembles, thermostats, or units.
•Run safe, sandboxed trial simulations that detect and fix errors before long production runs.
•Automate parameter sweeps (e.g., temperature ramps) and aggregate results with consistent logging.
•Recommend and manage interatomic potential files, including detecting missing files and suggesting close matches.
•Serve as a classroom lab partner, turning natural-language lab prompts into runnable scripts and explainers.
•Provide a lab helpdesk to diagnose failed jobs by parsing logs and proposing targeted fixes.
•Deploy locally on HPC clusters for private data and repeatable, tool-integrated workflows.
•Extend the closed-loop method to other simulators (e.g., GROMACS for biomolecules, OpenFOAM for fluids).
•Benchmark internal workflows against MD-EvalBench to track improvements over time.

Version: 1