QuantaAlpha: An Evolutionary Framework for LLM-Driven Alpha Mining

Jun Han; Shuo Zhang; Wei Li; Zhi Yang; Yifan Dong; Tu Hu; Jialuo Yuan; Xiaomin Yu; Yumo Zhu; Fangqi Lou; Xin Guo; Zhaowei Liu; Tianyi Jiang; Ruichuan An; Jingping Liu; Biao Wu; Rongze Chen; Kunyi Wang; Yifan Wang; Sen Hu; Xinbing Kong; Liwen Zhang; Ronghao Chen; Huacan Wang

QuantaAlpha: An Evolutionary Framework for LLM-Driven Alpha Mining

Intermediate

Jun Han, Shuo Zhang, Wei Li et al.2/6/2026

arXiv

Key Summary

•QuantaAlpha is a smart, evolving system that helps find trading signals (called alpha factors) even when markets are noisy and keep changing.
•Instead of fixing one idea and hoping it works, it treats each full research run like a story (a trajectory) and then improves the story by editing weak parts (mutation) or mixing the best parts of different stories (crossover).
•It uses a team of AI agents and large language models (LLMs) to go from a market idea, to a clear formula, to working code, to backtests you can trust.
•A special middle language (symbolic operators + AST) keeps the idea, formula, and code in sync so meanings don’t drift and bugs are caught early.
•It also keeps formulas simple and different from each other to avoid crowded, overly complex signals that fail in live trading.
•On China’s CSI 300, QuantaAlpha reached an IC of 0.1501 and an annual return of 27.75% with only 7.98% max drawdown, beating strong machine learning and agent baselines.
•Signals discovered on CSI 300 transferred well to CSI 500 and the S&P 500, earning about 160% and 137% cumulative excess return over four years, showing robustness across market shifts.
•Diversified planning, targeted mutation, and smart crossover each add value; removing them weakens results, especially mutation.
•Controls for semantic consistency, simplicity, and non-redundancy all matter; turning them off hurts both prediction and risk.
•This framework is more controllable, traceable, and trustworthy than earlier agent systems because it reuses validated pieces and records decision lineages.

Why This Research Matters

Financial markets change their moods, so a one-and-done signal can fade fast. QuantaAlpha evolves the entire research process, letting teams keep what works, fix what doesn’t, and reuse proven parts to stay adaptive. Its symbolic middle layer keeps ideas and code in sync, which means fewer silent errors and more trust in results. Simplicity and diversity rules reduce overfitting and crowding, improving real-world robustness. Strong cross-market transfer suggests these signals aren’t just lucky fits to one dataset. Better stability and lower drawdowns help strategies weather regime shifts like those seen in 2023. In short, this framework turns factor discovery into a repeatable, auditable, and more resilient practice.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how weather can change quickly, and yesterday’s umbrella advice might not help today? Financial markets are like that—full of surprises, noise, and sudden shifts. People who do quantitative investing try to find tiny, useful patterns (alpha factors) in that chaos. Before this research, many teams used hand-crafted rules, traditional machine learning, or newer AI agents to discover factors. These methods helped, but markets kept changing, and the signals were fragile.

🍞 Top Bread (Hook): Imagine you’re trying to hear a whisper in a loud cafeteria. If you listen the same way every day, you’ll often miss it when the crowd changes. 🥬 The Concept: Non-stationary markets are markets whose behavior changes over time (trend, volatility, who trades, and how). How it works: 1) A pattern works for a while. 2) The market’s style shifts (like big stocks leading one year, small themes the next). 3) The old pattern weakens or flips. Why it matters: If you assume the world stands still, your model breaks when it moves. 🍞 Bottom Bread (Anchor): A simple “buy recent winners” rule can work in steady trends, but may flop when choppy, gap-filled trading takes over.

The problem researchers faced had three big parts. First, fragile controllability: when AI agents adjusted factors based on noisy backtests, the meaning of the original idea could drift, pushing the system toward accidental patterns. Second, limited trustworthiness: many systems “re-generated” factors each round without clearly inheriting and reusing proven reasoning, so it was hard to audit why something worked. Third, constrained exploration: searches often stuck near initial guesses, producing crowded, lookalike signals.

🍞 Top Bread (Hook): You know how a group project goes off track if each version ignores the last? And how copying your friend’s idea doesn’t help the whole class learn more? 🥬 The Concept: Agent-based frameworks are teams of specialized AI helpers that share work (idea, build, test). How it works: 1) One agent proposes a market hypothesis. 2) Another turns it into a factor formula and code. 3) Another backtests and reports. Why it matters: If the team can’t control changes or remember what truly worked, it drifts and repeats itself. 🍞 Bottom Bread (Anchor): One agent suggests “overnight gaps matter,” another writes code wrongly measuring intraday moves, and the tester gives noisy feedback. Now the idea and the code don’t match.

People tried several fixes. They added rules during generation (to reduce complex, copycat factors), or set up multi-agent pipelines that co-optimize factors and models. These helped, but weren’t enough in real, shifting markets. What was missing? A way to improve not just a single factor, but the whole chain of decisions—keeping good parts, repairing bad steps, and building on validated experience in a traceable way.

🍞 Top Bread (Hook): Think of writing an essay. If you only tweak sentences at random, you might lose your main idea. But if you keep your outline, fix weak paragraphs, and borrow strong lines from past essays, your drafts get better fast. 🥬 The Concept: An evolutionary alpha mining framework treats each full research run as a trajectory (a step-by-step story) to evolve. How it works: 1) Start with diverse idea plans. 2) Build factors through a controlled, symbolic middle language. 3) Test them. 4) Mutate only weak steps. 5) Crossover strong segments from different runs. Why it matters: Without evolving trajectories, you keep re-rolling the dice; with it, you steadily inherit and improve what’s proven. 🍞 Bottom Bread (Anchor): If one run nailed a great hypothesis and another wrote robust code, crossover can combine them, producing a better child run you can trace and trust.

Real stakes? Better factor discovery isn’t just academic. It can mean steadier retirement funds, fairer risk control in turbulent times, and fewer false signals that cause big losses.

🍞 Top Bread (Hook): You know how a smoke alarm needs to be reliable even if breakfast gets smokier sometimes? 🥬 The Concept: Backtesting evaluation checks if a factor would have worked in the past and under costs. How it works: 1) Measure predictive correlation (IC, Rank IC). 2) Simulate a realistic portfolio (ARR, IR, MDD). 3) Track stability across years and regimes. Why it matters: Without careful testing, you might trust a fluke and face large drawdowns. 🍞 Bottom Bread (Anchor): A factor with 28% annual excess return and small drawdowns in multiple markets is more likely to survive live trading than one with a single spike in one year.

02Core Idea

The “aha!” moment in one sentence: Treat the entire alpha-mining run as a living story (trajectory) that you can edit at the exact weak sentence (mutation) and remix with the best sentences from other good stories (crossover), all while keeping the meaning consistent and the writing clean.

Analogy 1 — Essay drafting: You keep your outline (market idea), fix only the paragraph that’s unclear (mutation), and borrow a great transition from another essay (crossover) to make a stronger draft. Analogy 2 — LEGO building: Instead of rebuilding from scratch, you replace a wobbly piece (mutation) and snap in a sturdy module from another set (crossover), keeping instructions that match the final build. Analogy 3 — Cooking: You tweak the too-salty sauce (mutation) and blend the perfect crust from another recipe (crossover), following a written recipe so the dish matches the original flavor idea.

🍞 Top Bread (Hook): You know how mixing great parts from different projects makes a super project? 🥬 The Concept: Evolutionary alpha mining evolves full trajectories, not just single factors. How it works: 1) Plan many diverse hypotheses. 2) Build factors through a symbolic operator library + AST so the idea, formula, and code stay aligned. 3) Test with standard metrics. 4) Mutate localized weak steps using self-reflection. 5) Crossover high-reward segments from different parents to reuse proven patterns. Why it matters: Without trajectory-level edits, changes wander; with them, improvements are targeted, traceable, and reusable. 🍞 Bottom Bread (Anchor): If Run A’s hypothesis template is solid and Run B’s implementation passes all checks, the child run inherits both strengths and tends to score higher.

Before vs After:

Before: Agents regenerated factors stochastically from noisy feedback. Meanings drifted, code and ideas misaligned, and searches clustered around the same crowded neighborhoods.
After: The system repeatedly edits only broken steps, reuses validated pieces, keeps ideas-code semantics aligned, and diversifies the starting frontier—leading to higher IC, better ARR, and lower MDD.

Why it works (intuition without equations):

Localizing fixes reduces overreacting to noise. You don’t throw away the good; you patch the bad.
Recombining validated segments increases the chance that multiple good choices co-exist in one trajectory.
A symbolic middle layer acts like guardrails, forcing the implementation to match the hypothesis and preventing silent drift.
Simplicity and non-redundancy reduce overfitting and crowding, which helps generalize to new markets and new years.

Building blocks (each with a purpose):

Diversified Planning Initialization: start with different idea families (price vs volume, short vs long horizon) to cover more ground.
Symbolic Factor Construction (operator library + AST): turn hypotheses into clear, checkable formulas before code.
Semantic Consistency Verification: ensure hypothesis → formula → code keep the same meaning.
Complexity Control: keep formulas simple and parameter counts reasonable.
Redundancy Filtering: reduce near-duplicates so the final set isn’t crowded copies.
Backtesting & History: test fairly and remember which ideas and repairs worked.
Trajectory Mutation: fix the weakest link specifically, not the whole chain.
Trajectory Crossover: merge best segments from different winners to compound gains.

🍞 Top Bread (Hook): You know how a good team wins by practicing plays, fixing weak moves, and combining star players’ strengths? 🥬 The Concept: Self-evolution via mutation and crossover turns scattered tries into a learned playbook. How it works: 1) Keep track of what worked. 2) Patch weak steps. 3) Blend strong sub-plays. Why it matters: This steadily raises the floor and the ceiling of performance. 🍞 Bottom Bread (Anchor): That’s why QuantaAlpha beat strong baselines and still held up when the market regime flipped in 2023.

03Methodology

At a high level: Market data + optional seed ideas → Diversified Planning Initialization → Controlled Factor Realization (symbolic operators + AST → verified code) → Backtesting & history logging → Evolution (mutation and crossover) → Final factor pool and strategy.

Step 1: Diversified Planning Initialization

What happens: The planner agent proposes many complementary hypotheses: different sources (price vs volume), horizons (1-day vs 20-day), and mechanisms (momentum vs mean reversion vs regime-gated).
Why it exists: If all starts look alike, you get crowded search and miss better regions.
Example: One hypothesis says “overnight gaps signal new information,” another says “mean-revert only when volatility is unusually high compared to volume volatility.”

🍞 Top Bread (Hook): You know how starting with many puzzle corners helps you finish faster? 🥬 The Concept: Hypothesis Generation creates actionable, diverse market ideas. How it works: 1) Read market context and prior results. 2) Map theory into concrete operators and parameters. 3) Output multiple distinct plans. Why it matters: Without diversity, you get stuck near one idea and overfit. 🍞 Bottom Bread (Anchor): Plan A looks at auction gaps; Plan B looks at trend quality; Plan C mixes liquidity absorption with price impact.

Step 2: Controllable Factor Realization (symbolic middle layer)

What happens: The factor agent maps the hypothesis into a symbolic expression using a standard operator library (like TS_MEAN, RANK, SMA) and parses it into an AST. Then it compiles to code. If compilation fails, it repairs the code while preserving the symbolic meaning.
Why it exists: Direct code-from-text can drift or break. A symbolic checkpoint makes intent explicit and testable.
Example data: Suppose close=10 → 11 (intraday +10%), volume spikes 30%, and recent 5-day volatility is low; the factor might score higher for “orderly continuation.”

🍞 Top Bread (Hook): Think of building from a blueprint so the house matches the design. 🥬 The Concept: Factor Construction turns ideas into symbolic formulas and then into code. How it works: 1) Translate idea to discrete operators and parameters. 2) Build AST from these blocks. 3) Compile to code and auto-repair if needed. Why it matters: Without the blueprint, code can mismatch the idea and silently fail. 🍞 Bottom Bread (Anchor): The hypothesis says “rank 20-day price-volume correlation times 5-day average intraday return,” and the AST ensures the code does exactly that.

Step 3: Semantic Consistency, Complexity, and Redundancy Gates

What happens: A verifier checks that hypothesis, symbolic formula, and code mean the same thing. Then complexity limits keep formulas short and parameters few. Redundancy filters remove near-duplicate structures using AST similarity.
Why it exists: These gates prevent drift, overfitting, and crowding.
Example: A too-long chain of nested conditions is rejected; a near-copy of an existing factor is rewritten.

🍞 Top Bread (Hook): You know how telling the same story in your own words, drawings, and a performance should keep the same plot? 🥬 The Concept: Semantic Consistency keeps idea → formula → code aligned. How it works: 1) LLM verifier checks alignment. 2) If mismatch, regenerate only the inconsistent part. Why it matters: If meaning drifts, backtest feedback trains the wrong thing. 🍞 Bottom Bread (Anchor): If the idea says “overnight gap,” but code uses “intraday high-low,” the verifier flags and fixes it.

🍞 Top Bread (Hook): Packing for a trip, you keep only what you’ll use. 🥬 The Concept: Complexity Control keeps factors simple. How it works: 1) Limit symbolic length. 2) Limit free parameters. 3) Penalize too many base features. Why it matters: Overly complex formulas overfit and break in new regimes. 🍞 Bottom Bread (Anchor): A trimmed linear combination outperforms a tangled nest of conditions in later years.

🍞 Top Bread (Hook): If your closet has five gray hoodies, do you really need another? 🥬 The Concept: Redundancy Filtering removes near-duplicates. How it works: 1) Compare ASTs. 2) If a new factor is too similar to what you have, rewrite it. Why it matters: Crowding makes signals weak and unstable. 🍞 Bottom Bread (Anchor): Dropping a lookalike momentum factor frees room for a distinct overnight-gap factor.

Step 4: Backtesting and History

What happens: Each factor gets predictive metrics (IC, Rank IC), strategy metrics (ARR, IR, MDD), and notes on success/failure patterns.
Why it exists: You need fair, stable evaluation and a memory of what worked where.
Example: A factor that does well in 2016–2022 but collapses in 2023 may need regime conditioning.

🍞 Top Bread (Hook): You practice a sport and keep score to learn what drills worked. 🥬 The Concept: Backtesting Evaluation simulates past performance. How it works: 1) Predict next-day returns. 2) Build a simple, cost-aware portfolio. 3) Track returns and drawdowns over time. Why it matters: Without it, a fancy formula could be a fluke. 🍞 Bottom Bread (Anchor): The best QuantaAlpha run shows IC≈0.150 and ARR≈27.8% with MDD≈8%—numbers that beat other strong systems.

Step 5: Evolution — Mutation and Crossover

Mutation: Diagnose the weak step (e.g., the gating condition) and rewrite only that piece; keep the rest of the trajectory intact.
Crossover: Combine strong segments (like a robust hypothesis template from one parent and a reliable construction pattern from another) into a new child trajectory.
Why it exists: Targeted edits avoid noise-driven overhauls; recombination reuses proven building blocks for faster, steadier gains.
Example: Replace a brittle drawdown gate with a volatility-ratio gate (mutation), then adopt a verified factor-composition pattern from another run (crossover).

🍞 Top Bread (Hook): Fix the squeaky wheel; don’t junk the whole bike. 🥬 The Concept: Trajectory-level Mutation focuses edits on the weakest link. How it works: 1) Self-reflect to localize failure. 2) Rewrite that step. 3) Regenerate only the needed suffix to keep coherence. Why it matters: Global rewrites amplify noise; local fixes preserve signal. 🍞 Bottom Bread (Anchor): Swapping a noisy path-length proxy for a volatility-conditioned range measure improved 2023 stability.

🍞 Top Bread (Hook): Mix the best moves from two game plans into one playbook. 🥬 The Concept: Trajectory-level Crossover merges high-performing segments. How it works: 1) Select top parents. 2) Identify consistently good segments (hypothesis, construction, repairs). 3) Compose a coherent child. Why it matters: It compounds strengths with a verifiable lineage you can audit. 🍞 Bottom Bread (Anchor): A child combining “institutional momentum” from Parent A and “clean trend filter” from Parent B outperforms both.

Step 6: Final Factor Pool

What happens: Keep top factors by Rank IC, drop highly correlated ones, and cap pool size to maintain diversity.
Why it exists: A compact, diverse pool is more robust than a large, crowded library.
Example: A 0.7 correlation cutoff and size cap maintain balance between power and stability.

04Experiments & Results

The Test: Researchers used CSI 300 for training/validation/testing and then deployed the found factors directly on CSI 500 and S&P 500 with no retuning. They measured predictive power (IC, Rank IC, and their consistency ratios) and strategy performance (ARR, IR, MDD, CR) after costs. This mix checks both skill at forecasting and ability to turn that skill into returns with controlled risk.

The Competition: QuantaAlpha was compared against classical factor libraries (Alpha158/360), strong machine learning and deep models (LightGBM, XGBoost, GRU/LSTM/Transformer/TRA), and the latest LLM agent frameworks (RD-Agent, AlphaAgent) across multiple backbone LLMs, including GPT-5.2.

The Scoreboard (with context):

On CSI 300, QuantaAlpha with GPT-5.2 hit IC≈0.1501, ARR≈27.75%, MDD≈7.98%. Think of IC as a measure of how well the signal’s “whispers” line up with future returns: 0.150 is a strong, steady whisper. ARR near 28% with under 8% drawdown is like getting an A+ in returns while keeping risk low—when many others got B’s or C’s.
Versus RD-Agent (another strong agent system), QuantaAlpha improved IC by about 0.097 and ARR by about 17.8 percentage points, while cutting drawdown by nearly 6.8 percentage points. That’s like winning by multiple goals and taking fewer hits.
Versus AlphaAgent (which already uses generation-time regularization), QuantaAlpha still added roughly 0.054 IC and 12.2 percentage points ARR, and reduced drawdown by about 4.9 points. The edge comes from trajectory-level mutation/crossover and strict semantic/complexity controls.

Surprising and important findings:

Cross-market transfer: Factors mined on CSI 300 were deployed zero-shot on CSI 500 and S&P 500, delivering around 160% and 137% cumulative excess return over four years. That’s like training for one playing field and still winning big on two other fields with different turf.
2023 regime shift: Many baselines slumped in 2023 when the market rotated from large-cap “core assets” to small-cap, theme-driven moves with more gaps and noise. QuantaAlpha held up by emphasizing signals tied to overnight information, volatility structure, and trend quality with liquidity support—less sensitive to style shifts than plain trend-following.
Ablations show each piece matters: Removing diversified planning mainly hurt strategy stability; dropping mutation caused the biggest fall in predictive power and ARR; removing crossover also hurt, but less than mutation. Turning off semantic, complexity, or redundancy controls each degraded results, and turning off all three hurt the most. In other words, the guardrails and the evolution engine both pull their weight.
Efficiency of evolution: IC rose quickly in early rounds and then stayed high. The distribution kept some spread—evidence that the system stayed diverse instead of collapsing to one crowded idea. This suggests better sample efficiency and resilience to noise.

Concrete example from the case study: Starting with short-term reversal ideas, the system tried a volatility-weighted momentum variant that got too complex and didn’t generalize. Mutation then simplified it to a linear additive form with better drawdown control. Crossover later blended in participant-sensitive features (like price-volume correlations tied to institutional activity), further lifting predictability. This evolution mirrors how human quants refine hypotheses while keeping a clean, auditable trail.

05Discussion & Limitations

Limitations:

Dependency on input diversity and feedback quality: If initial hypotheses lack variety or the backtest environment is biased, evolution can explore too narrowly or follow noise.
Compute and orchestration: Multi-agent evolution with verification and backtesting needs solid infrastructure and time, especially with large LLMs.
Finite novelty: After many rounds, new factors may add redundancy and hurt robustness unless pool rules stay strict.
Market shocks: Extreme structural breaks can still overwhelm learned safeguards; regime-aware weighting helps but isn’t magic.

Required resources:

Reliable market data with careful preprocessing and realistic cost models.
LLM access (e.g., GPT-5.2 or strong open models) plus a verifier setup, compilation toolchain, and backtesting engine (e.g., Qlib).
Monitoring and storage for trajectory archives, evaluation histories, and factor pools.

When not to use:

Very short-horizon, ultra-high-frequency settings where latency dominates and symbolic factors may lag.
Illiquid assets with unreliable data where backtest slippage/cost assumptions can dwarf any signal.
Environments demanding guaranteed interpretability without any LLM involvement.

Open questions:

Can we make regime-awareness fully adaptive inside the evolution operators (e.g., volatility- or style-conditional mutation/crossover)?
How can portfolio construction and risk models co-evolve with factors to close the loop end-to-end?
Can we enrich the operator library with microstructure and fundamental features while keeping simplicity and transferability?
What’s the optimal stopping rule for evolution to avoid late-round redundancy while capturing most gains?
How to quantify and manage crowding risk across multiple firms using similar evolutionary approaches?

06Conclusion & Future Work

In three sentences: QuantaAlpha treats each end-to-end alpha-mining run as a trajectory and improves it by precisely fixing weak steps (mutation) and intelligently mixing the best validated segments (crossover), all guarded by semantic, simplicity, and non-redundancy checks. This makes exploration wider, refinement steadier, and results more trustworthy, beating strong baselines on CSI 300 and transferring robustly to CSI 500 and S&P 500. The approach shows that evolving the research process itself—rather than only regenerating factors—unlocks sturdier signals in noisy, changing markets.

Main achievement: Establishing trajectory-level self-evolution for alpha discovery, with a symbolic middle layer and strict gates that keep hypothesis, formula, and code meaningfully aligned while avoiding complexity and crowding.

Future directions: Add regime-aware evolution steps, extend to multi-asset/cross-market portfolios, and integrate co-evolution of factors with portfolio/risk engines to optimize end-to-end performance. Richer operator libraries (including fundamentals or microstructure) and better stopping rules could further improve generalization.

Why remember this: It reframes factor mining from random retries to guided evolution—reusing proven ideas, fixing only what’s broken, and keeping everything traceable—so signals survive when the market changes its mind.

Practical Applications

•Use diversified planning to seed multiple, distinct hypothesis families before any backtesting.
•Adopt a symbolic operator library and AST pipeline so hypotheses, formulas, and code stay aligned and auditable.
•Enable a semantic verifier that flags mismatches between intent, formula, and implementation for targeted repair.
•Set strict complexity limits (symbol length, parameter count, feature count) to reduce overfitting and improve transfer.
•Deploy AST-based redundancy checks to prevent near-duplicate factors and reduce crowding risk.
•Run trajectory-level mutation to localize and fix weak steps (e.g., gating or parameter scales) without rewriting everything.
•Use crossover to recombine validated hypothesis templates and reliable construction patterns from top-performing runs.
•Maintain a rolling factor pool with correlation caps (e.g., 0.7) and size limits to preserve diversity and robustness.
•Track regime-aware diagnostics (e.g., volatility ratios, overnight gap behavior) and integrate them into mutation logic.
•Periodically evaluate zero-shot transfer on out-of-sample markets to monitor generalization and detect alpha decay early.

Version: 1