M^4olGen: Multi-Agent, Multi-Stage Molecular Generation under Precise Multi-Property Constraints

Yizhan Li; Florence Cloutier; Sifan Wu; Ali Parviz; Boris Knyazev; Yan Zhang; Glen Berseth; Bang Liu

M^4olGen: Multi-Agent, Multi-Stage Molecular Generation under Precise Multi-Property Constraints

Intermediate

Yizhan Li, Florence Cloutier, Sifan Wu et al.1/15/2026

arXiv PDF

Key Summary

•The paper introduces M^4olGen, a two-stage system that designs new molecules to match exact numbers for several properties (like QED, LogP, MW, HOMO, LUMO) at the same time.
•Stage I builds a smart “prototype” molecule using a team of helper agents that look up similar examples and suggest small fragment edits with feedback from chemistry tools.
•Stage II fine-tunes that prototype using reinforcement learning (GRPO) that edits fragments one hop at a time to shrink the numeric error to the targets.
•They created a huge training resource: ~2.95M molecules labeled with fragments and properties, plus ~1.17M single-edit neighbor pairs for controllable reasoning.
•On QED/LogP/MW, M^4olGen cuts normalized total error to 0.146, beating strong LLMs (like GPT-4.1) by 42.7% and outperforming top graph-based baselines on most metrics.
•On tougher HOMO/LUMO targets, multi-hop refinement brings total error down to 0.155, more than 2× better than a strong genetic algorithm under similar budgets.
•Retrieval-augmented prototyping and fragment-level, multi-hop GRPO optimization both matter; ablations show each piece adds clear improvements, and more hops reduce error monotonically.
•Outputs stay chemically valid, diverse, and unique, while edits remain controlled to avoid drifting too far from the prototype.
•Limitations include reliance on fast computed or predicted properties and a small property set; deeper hops also cost more compute with diminishing returns.
•This approach promises faster, more precise discovery for drugs and materials where multiple numeric constraints must be hit exactly.

Why This Research Matters

When scientists need medicines or materials with very specific behaviors, “close enough” isn’t good enough; they must hit exact property numbers at the same time. M^4olGen turns that hard problem into a guided process: start near the goal using real examples, then make careful, small edits with instant feedback. This saves time and cost by avoiding trial-and-error searches that wander. It also keeps designs realistic and buildable by editing meaningful fragments and checking validity at every step. As a result, teams can move faster from idea to testable candidates across pharma, energy, and electronics. Over time, this precision-first approach could reduce development failures and speed breakthroughs that directly benefit health and technology.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine baking cookies for friends who each want something different: one wants exactly 5 chocolate chips, another wants exactly 2 nuts, and someone else wants exactly 10 sprinkles. You’re not just making tasty cookies—you’re hitting exact numbers, all at once.

🥬 The Concept (Molecular Generation): What it is: Molecular generation is when computers invent new molecules, like a chef making brand-new recipes. How it works (simple):

Describe the goal (the “recipe”): what properties the molecule should have.
Let a model propose molecule structures (like trying ingredient combos).
Check if those molecules are valid and good. Why it matters: Without good molecule generation, scientists have to search huge chemical spaces by hand, which is slow and expensive. 🍞 Anchor: A model might design a molecule that’s not just drug-like but also has exactly the right size and oil/water balance.

🍞 Hook: You know how a bike has to be built so the wheels fit and the chain doesn’t fall off? Molecules also need to “fit” chemical rules.

🥬 The Concept (Chemical Validity Checks): What it is: Validity checks make sure a designed molecule could exist in real life (no broken rules like impossible bonds). How it works:

Parse the molecule string (SMILES) to a structure.
Run rule checks (valence, aromaticity, connectivity) with a chemistry toolkit.
Reject or fix invalid molecules. Why it matters: Without validity checks, you’d propose “square wheels”—molecules that can’t be made. 🍞 Anchor: If a design gives a nitrogen too many bonds, validity tools flag it so the model edits it.

🍞 Hook: Think about choosing a backpack. You care about weight, size, and how waterproof it is. Molecules also have important “traits.”

🥬 The Concept (Physicochemical Properties): What it is: These are measurable traits of a molecule (like QED for drug-likeness, LogP for oiliness, MW for weight, HOMO/LUMO for energy levels). How it works:

Compute properties quickly with chemistry or ML tools.
Compare each property to the target number.
Adjust the molecule to move properties toward targets. Why it matters: If you miss the numbers, the molecule may not work as needed in the body or a device. 🍞 Anchor: A pill that’s too oily (high LogP) may not dissolve well; too heavy (high MW) may be hard to absorb.

The world before: Many AI models could push a single property up or down (like “make LogP higher”) but struggled to match several exact numbers at once. They often optimized a score, not precise target values.

The problem: Scientists often need exact multi-property hits, not just “good enough.” For example, a drug candidate may need QED=0.75, LogP=2.7, and MW=330, all together.

Failed attempts: Single-agent RL or graph generators improved some properties but often missed one or more targets. Plain LLMs were expressive but weak at precise numeric reasoning, so they drifted or overfit one property.

The gap: We needed a method that can plan using examples, make small, meaningful structure edits, get fast numeric feedback, and keep tightening until all target numbers line up.

Real stakes: Better hits save years and millions in drug development. In materials, nailing HOMO/LUMO can make solar cells or LEDs far more efficient. Precise control turns guessing into guided making.

02Core Idea

🍞 Hook: You know how building a Lego model is easier if you first pick a picture that’s close to what you want, then snap a few bricks on or off to perfect it?

🥬 The Concept (Retrieval-Augmented Prototyping): What it is: Start by looking up real molecules close to the targets, then build a “prototype” using tiny part swaps. How it works:

Read the numeric goals (like QED, LogP, MW).
Retrieve similar molecules from a big library.
Propose small fragment edits guided by feedback.
Stop when you’re near the target zone. Why it matters: Starting close reduces wandering and speeds up getting a good base design. 🍞 Anchor: If the target MW is 330 and LogP is 2.7, retrieval finds examples near those values to copy a good backbone.

🍞 Hook: Picture a school project where a planner collects ideas, a builder assembles parts, and a checker tests if it works. Teamwork beats one person doing everything.

🥬 The Concept (Multi-Agent Reasoning): What it is: Several helper agents each do a simple job—interpret the request, retrieve references, suggest edits, and check properties—then collaborate. How it works:

One agent reads the targets from text.
Another retrieves similar molecules.
A reasoner proposes fragment edits.
A checker computes property errors and validity. Why it matters: Dividing jobs makes the plan clearer and the edits smarter. 🍞 Anchor: One agent says “MW too low by 20,” another suggests “add CF3 to raise MW and LogP,” and the checker confirms the change.

🍞 Hook: Think of trimming a haircut one snip at a time. Small, careful changes get you exactly the style you want.

🥬 The Concept (Fragment-Level Optimization): What it is: Improve the molecule by editing small building blocks (fragments) step by step. How it works:

Break the molecule into meaningful fragments.
Choose actions: add, remove, or replace.
After each edit, re-check properties.
Keep changes that reduce total error. Why it matters: Tiny, local edits give precise control and keep molecules valid. 🍞 Anchor: If LogP is a bit high, swap a phenyl for a pyridine ring to lower oiliness without breaking the whole molecule.

🍞 Hook: Imagine a game where you try several moves, then keep the move that beats the others. You learn what works by comparing within the group.

🥬 The Concept (GRPO – Group Relative Policy Optimization): What it is: A learning method that samples multiple candidate edits, ranks them, and nudges the model to favor the better ones. How it works:

Generate a small group of edited candidates.
Score each using a reward tied to distance to targets and validity.
Rank them; increase the chance of better edits next time.
Repeat across many steps and molecules. Why it matters: This stabilizes learning and directly optimizes numeric targets without needing perfect labels. 🍞 Anchor: If one edit cuts total error from 0.30 to 0.20 while others don’t, GRPO learns to propose that kind more often.

Before vs. After: Before, models either wandered or hit one target and missed others. After, retrieval gives a strong starting point and GRPO’s fragment edits tighten every property together.

Why it works: Properties shift in predictable ways when you add or swap chemically meaningful fragments. Fast property calculators give instant feedback, so the optimizer can climb toward exact target numbers.

Building blocks: Retrieval, multi-agent planning, fragment actions (add/remove/replace), fast oracles for properties/validity, and GRPO’s group-wise learning signal.

03Methodology

High-level recipe: Input (target numbers) → Stage I (retrieve + prototype via multi-agent reasoning) → Stage II (GRPO-trained fragment optimizer, 1–3 hops) → Output (validated molecule with minimized errors).

🍞 Hook: You know how following a map and then fine-tuning your steps helps you reach the exact house number on a street?

🥬 The Concept (Prototype Generation): What it is: Stage I builds a near-miss molecule close to all target numbers. How it works:

Parse the request and extract numbers.
Retrieve similar molecules from a big database with tight tolerances.
Propose small fragment edits guided by property feedback.
Stop when the error to the target is small. Why it matters: A good prototype makes final tuning faster and safer. 🍞 Anchor: For QED=0.75, LogP=2.7, MW=330, the prototype might land at QED=0.74, LogP=2.95, MW=315—a strong starting point.

🍞 Hook: Think of LEGO sets that break into useful chunks. Swapping a chunk is easier than rebuilding everything.

🥬 The Concept (BRICS Fragments): What it is: A rule-based way to cut molecules into realistic, synthesis-friendly pieces. How it works:

Identify bonds that are commonly made/broken in chemistry.
Split along those bonds to get fragments.
Keep a map of how fragments connect. Why it matters: Edits stay chemically sensible and valid. 🍞 Anchor: Cutting at amide or ether linkages yields fragments you can add, remove, or replace safely.

🍞 Hook: When playing a video game, tiny nudges on the joystick fine-tune your aim. Molecules can be fine-tuned too.

🥬 The Concept (Distance-to-Target Objective): What it is: A single score that measures how far the molecule is from all target numbers combined. How it works:

Compute per-property errors (e.g., |QED−target|).
Scale/weight errors so units are comparable.
Sum them to a total error. Why it matters: A single, clear score lets the optimizer know if it’s getting closer overall. 🍞 Anchor: If QED is perfect but MW is off by 25, the score tells you to focus on MW next.

🍞 Hook: Like asking a calculator for instant answers, we ask chemistry tools for instant property checks.

🥬 The Concept (RDKit Oracles): What it is: Fast computer tools that compute properties and check validity. How it works:

Parse SMILES, ensure valence/aromaticity are okay.
Compute QED, LogP, MW (and use ML for HOMO/LUMO when needed).
Return numbers immediately for feedback. Why it matters: Fast, reliable feedback makes learning and editing efficient. 🍞 Anchor: After adding CF3 to raise LogP and MW, RDKit confirms by how much.

Stage I details (what/why/examples):

What happens: Agents parse the request, retrieve close examples, suggest and test fragment edits, and keep the best candidate so far.
Why this step exists: Without a good start, later optimization wastes time fixing big gaps.
Example: Retrieval finds molecules near LogP 2.7 and MW 330, the reasoner swaps a ring and adds a small polar group, ending near all targets.

🍞 Hook: You know how you might take one step, check your map, then take another? That’s safer than sprinting in a random direction.

🥬 The Concept (Multi-Hop Refinement): What it is: Stage II makes 1–3 small fragment edits, checking after each, to steadily shrink the error. How it works:

At each hop, propose one fragment action (add/remove/replace).
Keep the change only if it lowers total error and remains valid.
Stop after the hop budget is used or targets are met. Why it matters: Small steps keep control, reduce risk, and allow interpretable progress. 🍞 Anchor: Hop 1 trims MW by swapping a ring; Hop 2 adjusts LogP with a heteroatom; Hop 3 fine-tunes QED with a subtle side chain.

Stage II details (GRPO training):

What happens: The optimizer model generates several candidate edits per hop; a reward ranks them by property closeness, validity, format, and non-repetition; the model learns to favor higher-ranked edits.
Why this step exists: Ranking within a small group is stable and sample-efficient, directly pushing toward numeric targets.
Example with data: If three edits yield total errors 0.24, 0.19, and 0.30, the 0.19 edit is ranked best; the policy updates to make similar proposals more likely.

The secret sauce:

Retrieval anchors edits in-distribution (realistic chemistry).
Fragment granularity keeps validity and enables precise adjustments.
GRPO’s group-relative signal stabilizes learning toward exact numeric targets.
A large neighbor-pair dataset provides deterministic single-step supervision for controllable multi-hop reasoning.

04Experiments & Results

🍞 Hook: Imagine a spelling bee where not only must you spell words right, but you must also say them at a certain speed and volume—hitting multiple targets at once.

🥬 The Concept (Normalized Total Error): What it is: A single score that fairly combines errors across properties with different scales (like QED in 0–1 vs. MW in hundreds). How it works:

Compute MAE for each property.
Normalize by their ranges/scales so each contributes fairly.
Sum to get one overall number—lower is better. Why it matters: It lets us quickly compare methods on multi-target accuracy. 🍞 Anchor: Scoring 0.146 is like getting an A+, compared to 0.255 (a solid B) from a strong baseline.

The test: The team sampled 100 random target tuples and ran 10 trials per tuple for each method under the same budget, reporting the best-of-10. They measured per-property MAE, normalized total error (NTE), diversity (how varied the set is), and uniqueness (no duplicates).

The competition: Commercial and open LLMs (e.g., GPT-4.1, Claude, Gemini), specialized chemical LLMs, and strong graph-based models (STGG+, Graph GA) competed against M^4olGen.

The scoreboard (QED/LogP/MW):

M^4olGen (3-hop, GPT-4o) achieved NTE=0.146, beating GPT-4.1’s 0.255 by 42.7% and outperforming graph baselines on overall score.
It reached excellent LogP error (0.284) and very low MW error (9.799), while staying competitive on QED (0.103).
Diversity ≈ 0.884 and uniqueness = 1.0 show broad exploration without duplicates. Context: Compared to a strong graph model (STGG+), M^4olGen nearly halves the LogP error and slashes MW error by ~85% while keeping QED solid.

The scoreboard (HOMO/LUMO):

Even 1 hop cut total error to 0.540; 2 hops reached 0.227; 3 hops set a new low at 0.155 (HOMO 0.060, LUMO 0.095), more than 2× better than Graph GA-1000.
This shows balanced control: both HOMO and LUMO errors shrink together, not just one.

Surprising findings:

Error decreases monotonically with hop count (1→2→3), confirming the value of small, controlled edits.
Retrieval alone gives meaningful gains; adding the fragment optimizer delivers the biggest jump, especially for MW.
After training once, inference is fast—far fewer oracle calls than re-running search-heavy algorithms each time.

Ablations clarify contributions:

No retrieval: NTE ≈ 0.307.
Add retrieval: NTE drops to 0.265 (better LogP and MW).
Add 1/2/3-hop optimizer: NTE 0.187/0.160/0.146—steady improvements, with MW error plunging from ~63 to ~10.

Takeaway: Retrieval-anchored prototypes plus GRPO-driven, fragment-level multi-hop refinement is a winning combo for hitting exact, multi-property numeric targets while keeping outputs valid, diverse, and unique.

05Discussion & Limitations

Limitations:

Property reliance: Many evaluations use fast computed properties (RDKit) or ML-predicted values (for HOMO/LUMO). These are practical and reproducible, but not full physics or wet-lab results. Real-world performance may differ and needs confirmation.
Property scope: The study focuses on QED, LogP, MW, and HOMO/LUMO. Important real-world constraints like solubility at specific pH, permeability, metabolism, or synthetic accessibility could be added but were not central here.
Compute vs. hops: More hops reduce error but increase runtime with diminishing returns. Choosing hop budgets is a trade-off between precision and speed.
Data coverage: Even a 2.95M-molecule corpus can’t cover all chemistries. Very exotic targets or scaffolds may still require exploration beyond the training distribution.

Required resources:

Access to a large, annotated molecule set with fragment information (the provided dataset helps).
A property/validity toolkit (e.g., RDKit) and, for advanced targets, a reliable ML or physics-based oracle.
A mid-sized LLM policy (e.g., ~8B parameters) and a single high-memory GPU (e.g., A100 40GB) for GRPO training.

When not to use:

If properties are expensive to evaluate (e.g., long DFT runs per step) and can’t be approximated quickly, training/refinement may be too slow.
If constraints are symbolic/synthetic-only (e.g., “must be synthesizable in two steps with a specific route”) without numeric targets or fast proxies, other planners may be better.
If exact target numbers are not needed (only broad ranges), simpler conditional generators may suffice.

Open questions:

How well does performance transfer to richer, slower oracles (full ADMET panels, quantum-accurate energies) with uncertainty?
Can we incorporate synthetic feasibility, patentability, and novelty directly into the reward without harming target accuracy?
What are the best curricula for hop scheduling and property weighting to balance convergence speed and precision?
How can we extend beyond fragments to incorporate 3D conformations and stereochemistry while keeping edits controllable?

06Conclusion & Future Work

Three-sentence summary: M^4olGen is a two-stage system that first builds a retrieval-anchored prototype with multi-agent reasoning, then fine-tunes it using GRPO to make precise fragment edits. This tightly controls multiple numeric properties at once (QED, LogP, MW, HOMO, LUMO), achieving lower errors than strong LLM and graph baselines. The method stays valid, diverse, and efficient thanks to fragment-level actions, fast property feedback, and a large neighbor-pair dataset.

Main achievement: Demonstrating that retrieval-anchored prototyping plus GRPO-driven fragment optimization can reliably hit exact multi-property numeric targets, with monotonic improvements from controlled multi-hop edits.

Future directions:

Add richer properties and constraints (full ADMET, synthesis routes, patent filters) and couple to slower but more accurate oracles via smart approximations.
Integrate 3D structure awareness and stereochemistry into fragment edits for finer control.
Explore adaptive hop budgets and uncertainty-aware rewards to handle noisy or conflicting targets.

Why remember this: It turns molecule design from “push a score up” into “hit these exact numbers,” using teamwork (multi-agent retrieval) and careful edits (fragment-level GRPO). That shift—from vague optimization to precise target matching—can accelerate real drug and materials discovery in a measurable, repeatable way.

Practical Applications

•Design drug-like molecules that match exact QED/LogP/MW targets for better absorption and exposure.
•Tune organic semiconductor candidates to precise HOMO/LUMO values for improved solar cells or OLEDs.
•Rapidly generate libraries around a given scaffold with controlled multi-property variations.
•Hit multiple formulation constraints (e.g., MW and LogP windows) for brain-penetrant CNS drugs.
•Create starting points that satisfy medicinal chemistry “rules” while precisely adjusting one property at a time.
•Pre-screen molecules to match device specs (e.g., bandgap proxies from HOMO/LUMO) before expensive simulations.
•Support human-in-the-loop workflows where chemists request numeric tweaks and get fragment-level edits with rationales.
•Build property-balanced analogs that stay close to a lead (small hop budgets) for SAR exploration.
•Automate prototype-to-candidate refinement using GRPO to minimize total error under strict budgets.
•Benchmark new property oracles by plugging them into the same two-stage optimization loop.

Version: 1