Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies
Key Summary
- •Large language models (LLMs) don’t act as a single brain; inside, each layer and module quietly makes its own mini-decisions called internal policies.
- •You can turn any hidden state inside a Transformer into a real, samplable probability over words, so each layer effectively has its own policy.
- •Across layers, uncertainty (entropy) usually starts high for exploration and ends low for final answers, but Qwen reduces uncertainty gradually while Llama drops it suddenly at the end.
- •Measuring how entropy changes inside attention and FFN modules reveals a three-stage pattern in Qwen: Explore → Integrate knowledge → Converge to an answer.
- •If you lightly train an internal lower-layer policy first, lower layers learn high-level reasoning features early (feature refinement), making the whole model reason better later.
- •The new method, Bottom-up Policy Optimization (BuPO), first optimizes a chosen internal layer policy for a short time, then switches to normal RL on the full model.
- •BuPO beats strong RL baselines (like PPO, GRPO, RLOO) on tough math benchmarks such as MATH500, AMC23, AIME24, and AIME25.
- •Too much internal-layer training can hurt; a small number of steps works best (moderate bottom-up alignment).
- •This approach suggests we can guide reasoning foundations instead of only polishing the final output layer.
- •It opens a path to more interpretable and controllable AI reasoning by aligning where and how reasoning grows inside the model.
Why This Research Matters
This work shows we can improve AI reasoning not just by tuning the final output, but by coaching earlier layers where reasoning truly grows. That makes AI more reliable at multi-step tasks like math, coding, and scientific analysis. It provides a transparent window into how different model families (like Qwen vs. Llama) think inside, enabling smarter training choices. By measuring and shaping internal uncertainty, we can get better results with fewer fixes at the end. This also opens doors to safer, more controllable AI because we can guide how and where the model narrows its options. In practice, it means stronger tutoring systems, better problem solvers, and AI that learns in a more human-like, staged way.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine a big sports team where everyone only watches the final scorer. Coaches keep yelling at the last player to fix the score, ignoring the passes, strategies, and teamwork that got the ball there.
🥬 The Concept (The World Before): A few years ago, large language models (LLMs) became great at talking, answering questions, and even solving math problems. To get better, people used reinforcement learning (RL) to reward good answers and discourage bad ones. Almost all RL methods treated the whole LLM like one single decision-maker: they only trained the final output distribution (the last step that picks the next word). How it worked:
- Give a question.
- The LLM produces an answer.
- A reward function judges it.
- RL nudges the model so next time it picks better final tokens. Why it mattered: This improved results but ignored everything happening inside the model’s layers, where the real reasoning grows.
🍞 Anchor: It’s like telling the striker to shoot better without ever training the midfield to pass smarter.
🍞 Hook: You know how building a house needs a strong foundation, sturdy walls, and finally a nice roof? You can’t just polish the roof and expect the house to stand.
🥬 The Concept (The Problem): Researchers realized that optimizing only the final output misses the deep, layered process where the LLM forms and refines ideas. The tricky part was: how do we look inside the model to see what each layer is doing, and how do we train those inner parts directly? How it worked before (failed attempts):
- People peeked at internal states with tools like the logit lens to guess likely tokens mid-layer, but mainly for interpretation, not training.
- Some methods nudged attention patterns, but still treated the policy as one unit.
- Most RL pipelines adjusted the final policy only, assuming inner layers would somehow follow along. Why it failed: Without understanding or training the inner decision-makers, you get a shiny finish on a shaky structure. Models can overfit to superficial patterns, struggle with long chains of reasoning, or collapse when pushed too hard.
🍞 Anchor: It’s like practicing only the last move of a dance routine while ignoring the steps that lead into it—you stumble when the music changes.
🍞 Hook: Imagine a relay race. Each runner (layer) hands the baton (information) to the next. If early runners wander, the last runner can’t save the race.
🥬 The Concept (The Gap): We were missing a way to (1) turn internal hidden states into real, samplable policies and (2) measure how uncertainty evolves as information flows through attention and feed-forward networks (FFNs). We also didn’t know if training these internal policies directly would help the whole model. How it works now:
- Treat every layer’s hidden state as a mini-policy over the vocabulary (a real probability over next tokens).
- Track entropy (uncertainty) layer by layer and inside modules (attention vs. FFN).
- Discover patterns in how models explore and then converge.
- Try training a chosen internal layer policy first, then switch to normal whole-model RL. Why it matters: If internal reasoning emerges bottom-up, we should guide it bottom-up. This can build a sturdier reasoning foundation.
🍞 Anchor: It’s like coaching the first runners to pass cleanly and steadily, so the whole relay team runs smoother and faster.
🍞 Hook: Think of two students solving a puzzle. One narrows down choices step by step (gradual). The other guesses wildly until the very end, then suddenly picks an answer (abrupt).
🥬 The Concept (Real Stakes): The paper finds that Qwen models reduce uncertainty steadily across layers (like thoughtful narrowing), while Llama models keep things loose and collapse uncertainty only in the final few layers (a last-minute jump). This difference matters for training: gradual structures may learn better from bottom-up guidance, while abrupt ones may need different choices of where to train. How it works:
- Measure internal entropy across layers.
- See consistent explore → converge trends overall.
- Spot model-family differences: Qwen shows Explore–Integrate–Converge in FFN; Llama often converges late. Why it matters: Training that respects these inner patterns yields bigger gains on hard reasoning tasks—like real math exams.
🍞 Anchor: On math benchmarks (MATH500, AMC23, AIME24/25), the proposed method that coaches the early steps first leads to better scores than standard RL that only polishes the final answer.
02Core Idea
🍞 Hook: You know how a choir sounds best when each section (sopranos, altos, tenors, basses) practices their part before singing together? If you only rehearse the final chord, the song falls apart.
🥬 The Concept (The Aha! Moment): Inside an LLM, every layer and module quietly acts like its own mini policy over the next token—and if we train the right internal policy first (bottom-up), the whole model learns to reason better. How it works (3 analogies):
- Orchestra analogy: Each instrument section (layer) can produce music (a token distribution). Practice key sections first, then the full orchestra plays beautifully.
- House analogy: Pour and level the concrete foundation (lower layers), then the walls and roof (higher layers) naturally align.
- Hiking guide analogy: Early trail markers (lower layers) set the path. If they’re clear, the final turn (last layer) becomes obvious. Why it matters: Instead of forcing the last layer to fix everything, we help the earlier layers carry their share of reasoning, making the whole system more stable and accurate.
🍞 Anchor: In experiments, briefly training a lower-layer policy first improved scores across multiple math benchmarks compared to top-down-only RL.
🍞 Hook: Imagine reading a mystery. At first, you consider many suspects (high uncertainty). As you gather clues, you narrow it down (lower uncertainty). Different readers may narrow gradually or all-at-once near the end.
🥬 The Concept (Before vs. After): Before, RL focused on the final output policy only. After, we treat internal hidden states as real policies, track their uncertainty (entropy), and train selected ones first. This reveals that Qwen’s FFNs naturally flow through three stages—Explore → Integrate → Converge—while Llama models often converge abruptly near the end. How it works:
- Turn any layer’s hidden state into a token probability distribution (a policy).
- Measure entropy and entropy change to see exploration vs. convergence.
- Select a layer that still explores (positive entropy change), optimize it briefly, then finish with whole-model RL. Why it matters: This matches how reasoning grows inside the model and avoids overloading the final layer.
🍞 Anchor: Qwen models trained this way gained several points on AIME and AMC, like level-jumping from a B to an A.
🍞 Hook: Picture a garden: if you water the roots first (lower layers), the whole plant grows healthier, and flowers (answers) bloom more reliably.
🥬 The Concept (Why It Works—intuition, not equations): Transformers pass information through a residual stream, adding each layer’s contribution on top of what came before. If you directly train an internal layer policy, gradients flow only to that layer and below—perfect for strengthening the foundation without scrambling the roof. This causes feature refinement: lower layers start representing high-level reasoning earlier, so higher layers don’t have to scramble at the end. How it works:
- Residual connections let us separate a layer’s part from later contributions.
- We decode that layer’s hidden state into a policy, so it’s trainable.
- A short bout of bottom-up training nudges early features toward the final goal. Why it matters: Done moderately, this creates a stable base for later reasoning; done too long, it overshoots and hurts performance.
🍞 Anchor: In training curves, a little bottom-up alignment raises rewards and exploration early; too much causes perplexity spikes and answer repetition—so moderation wins.
🍞 Hook: Think of learning to bake. If you get the dough right (early layers), the baking step (final layer) is easy.
🥬 The Concept (Building Blocks):
- Internal Layer Policy: Treat each layer’s hidden state as a real, samplable policy.
- Internal Modular Policy: Do the same for attention and FFN outputs to see their unique roles.
- Entropy and Entropy Change: Track how uncertain each internal policy is, and how a module expands or contracts uncertainty.
- Bottom-up Policy Optimization (BuPO): First optimize a chosen internal layer policy for a few steps, then switch to standard whole-model RL. Why it matters: These parts work together to reveal hidden reasoning structure and to train it in the natural order it develops.
🍞 Anchor: Pick the FFN layer that still explores, train it a little, then finish with GRPO on the whole model—simple schedule, big gains.
03Methodology
At a high level: Input question → Turn internal hidden states into mini-policies → Measure uncertainty patterns → Briefly train a chosen internal layer policy → Switch to standard RL on the full model → Output better answers.
Step 1: Treat hidden states as internal policies. 🍞 Hook: Imagine pausing a video game at any frame and deciding your next move from that freeze-frame. 🥬 The Concept: Any hidden state in a Transformer can be turned into a real probability over the next token (a mini-policy). How it works:
- Run the model and capture a layer’s hidden state.
- Map it into vocabulary space with the model’s output head.
- Normalize to get a proper probability distribution. Why it matters: Now each layer can be sampled, scored, and trained like a policy—not just the final one. 🍞 Anchor: Midway through solving “What is 13×7?”, a middle layer might still consider multiple partial results; we can read that uncertainty directly.
Step 2: Separate by layer and by module (attention vs. FFN). 🍞 Hook: Think of a kitchen with two stations: one that gathers ingredients (attention) and one that mixes and cooks (FFN). 🥬 The Concept: Internal Layer Policies use a whole layer’s hidden state; Internal Modular Policies use the submodule outputs (attention-only or FFN-only). How it works:
- For a layer, compute its full hidden state policy.
- For attention and FFN, compute their individual policies to see their unique influence.
- Compare them across depths to understand roles at different stages. Why it matters: This shows whether exploration comes mainly from attention, FFN, or both—and at which depths. 🍞 Anchor: In Qwen, lower FFN layers expand possibilities (explore), mid FFN layers stabilize using stored knowledge (integrate), and upper FFN layers narrow to a decision (converge).
Step 3: Measure entropy and entropy change. 🍞 Hook: You know how guessing a riddle feels uncertain at first (many options), then certain at the end (one answer)? 🥬 The Concept: Entropy measures how spread-out (uncertain) a policy is; entropy change shows if a module expands or shrinks that uncertainty. How it works:
- Compute entropy for a policy (higher = more exploration).
- Compare a module’s input vs. output entropy to get entropy change.
- Map patterns across layers to see how reasoning evolves. Why it matters: This reveals model-family habits: e.g., Qwen’s gradual contraction vs. Llama’s last-minute convergence. 🍞 Anchor: If an FFN’s entropy change is positive at a certain layer, that layer is still exploring; a negative change signals it’s starting to lock in on an answer.
Step 4: Choose a target internal layer for bottom-up training. 🍞 Hook: Like picking the best domino to tip so the whole line falls smoothly. 🥬 The Concept: Select a lower or middle layer that still explores (positive entropy change), often a late-lower FFN layer. How it works:
- Scan layers for where exploration is healthy but not chaotic.
- Prefer FFN layers near the boundary from exploration to integration.
- Fix a small number of internal-training steps (for example, 20–30) to avoid overfitting. Why it matters: This is the sweet spot for shaping foundational reasoning without destabilizing the top layers. 🍞 Anchor: In Qwen3-4B, a layer around the first region boundary (like layer 6) worked well for short, effective alignment.
Step 5: Phase 1—InterGRPO trains only the chosen internal policy (short and sweet). 🍞 Hook: Imagine a coach who trains the midfield passing drills first, before scrimmaging with the whole team. 🥬 The Concept: InterGRPO is a variant that computes advantages from normal rollouts but updates only the chosen internal layer policy; due to residual connections, gradients flow to that layer and below. How it works:
- Generate responses with the old full policy and compute rewards.
- Form an importance ratio for the chosen internal policy and apply a clipped update (PPO-style stability).
- Update parameters only up to the target layer (and the output head), freezing higher layers. Why it matters: This concentrates learning on foundational reasoning without disturbing the top-level decision-maker yet. 🍞 Anchor: After ~20–30 steps, lower layers start encoding higher-level reasoning features earlier—feature refinement.
Step 6: Phase 2—Switch to standard RL on the full model (GRPO). 🍞 Hook: After passing practice, you run full-team scrimmage so everything works together. 🥬 The Concept: GRPO optimizes the entire model policy using grouped rollouts and stable clipping, now building on the refined foundation. How it works:
- Resume normal on-policy updates on the full model.
- Keep hyperparameters consistent to isolate the benefit of the bottom-up phase.
- Monitor reward, entropy, and response length to avoid degenerate behaviors. Why it matters: The head start from Phase 1 makes Phase 2 more effective and stable on complex reasoning. 🍞 Anchor: Curves show early entropy boosts (healthy exploration) and higher rewards than training with GRPO alone.
The Secret Sauce: Feature refinement via the residual stream. 🍞 Hook: Think of sharpening a pencil at the base so every word you write afterward is clearer. 🥬 The Concept: Bottom-up updates push lower layers to carry more of the reasoning load earlier, making upper layers’ job simpler and steadier. How it works:
- The residual stream adds each layer’s contribution, so aligning a lower contribution improves everything above it.
- A short alignment avoids overfitting and preserves diversity.
- The result is better Pass@K on tough math tasks. Why it matters: It’s targeted, interpretable, and practically effective. 🍞 Anchor: With Qwen3-8B, BuPO led to higher average Pass@K scores than GRPO across K from 1 to 256.
Tiny walkthrough with data:
- Question: “Compute 48×25.”
- Early attention expands options (try breaking numbers, recall mental math tricks). Entropy is high.
- Lower FFN explores candidate decompositions (like 50Ă—24, 25Ă—40+25Ă—8). Entropy expands or holds.
- Middle FFN integrates known facts (25Ă—4=100) and narrows options. Entropy stabilizes.
- Upper FFN converges on 48Ă—25 = 1200. Entropy drops.
- BuPO briefly trains the layer that’s still exploring so it picks the right decomposition earlier, helping the final answer lock in more reliably.
04Experiments & Results
🍞 Hook: Imagine four tough math tournaments. You don’t just want one lucky win—you want consistent top finishes across all of them.
🥬 The Concept (The Test): The authors measured how well models solved hard math problems where one wrong step can ruin the final answer. How it works:
- Benchmarks: MATH500, AMC23, AIME24, AIME25—known for challenging multi-step reasoning.
- Metrics: Avg@K (average Pass@1 across multiple samples per problem) and Pass@K (chance at least one of K samples is correct). Larger is better.
- Competitors: PPO, GRPO, Reinforce++, RLOO—strong RL baselines. Why it matters: If BuPO really strengthens internal reasoning, it should shine on problems that demand careful chains of thought.
🍞 Anchor: Think of Avg@K as your average tournament grade and Pass@K as the chance you get at least one perfect performance in K tries.
🍞 Hook: Picture two teams: one practices only final shots (baselines), the other first drills passing (BuPO) and then scrimmages.
🥬 The Concept (The Competition and Scoreboard):
- On Qwen3-4B, BuPO improved AIME24 by about +4.69 points and AIME25 by +2.30 over GRPO, also lifting AMC and MATH500.
- On Qwen3-8B, BuPO added +4.58 points on AIME24 and +0.76 on AIME25 over GRPO.
- On Llama-OctoThinker 3B/8B, BuPO consistently beat baselines, with notable average gains (about +1.01 and +3.68 points respectively), indicating the method works across families.
- Across K from 1 to 256, BuPO generally had the best Pass@K trade-off, especially strong on Qwen3-8B and both Llama models. Why it matters: Gains on AIME and AMC are like moving from a B- to an A on national math contests—real improvements, not noise.
🍞 Anchor: For Qwen3-8B, BuPO dominated GRPO across nearly all K values, showing stronger reliability even when sampling many answers.
🍞 Hook: You know how warming up a bit helps you run better, but running sprints before the race can tire you out?
🥬 The Concept (Surprising Findings): A little bottom-up training goes a long way; too much hurts. How it works:
- With moderate internal steps (like 20–30), rewards rise and entropy exploration looks healthy.
- Training too long at the bottom leads to perplexity spikes and model collapse.
- Optimizing the penultimate layer alone caused overly long, repetitive outputs—confirming that final decision-making is concentrated at the very top. Why it matters: The schedule is key; short, targeted alignment is the sweet spot.
🍞 Anchor: In ablations, increasing the internal phase beyond ~30 steps tanked average scores dramatically—proof that moderation matters.
🍞 Hook: Imagine two student styles again: Qwen narrows steadily; Llama narrows suddenly.
🥬 The Concept (Interpretability-backed Insights): Internal entropy and entropy change reveal that Qwen’s FFN naturally follows Explore → Integrate → Converge. This human-like progression aligns with the success of bottom-up alignment. Llama’s abrupt final-layer convergence can still benefit, but requires careful layer selection. How it works:
- Track module-wise entropy change: attention often expands exploration; FFN follows the three-stage pattern in Qwen.
- Pick a late-lower FFN layer with positive exploration to optimize first.
- Observe feature refinement: lower layers start to mirror useful top-layer features earlier. Why it matters: Understanding these patterns helps choose the right layer and duration for BuPO.
🍞 Anchor: The method didn’t just score better; it also gave a window into how and where reasoning forms inside the model, guiding practical choices.
05Discussion & Limitations
🍞 Hook: Think of tuning a guitar: a small twist brings sweet harmony; too much snaps the string.
🥬 The Concept (Limitations): BuPO needs moderation. If you train the internal layer for too long, the model can collapse—perplexity rises, answers get repetitive or too long, and rewards drop. Also, most analysis centers on Qwen and Llama families; patterns may differ in other architectures (like models with very different FFN/attention designs), so layer selection may need re-discovery. Finally, BuPO relies on access to internal states and the output head, which some closed systems don’t expose. How it works:
- Watch entropy and response length to avoid drift.
- Use short internal phases (e.g., 20–30 steps) as a safe default.
- Re-scan entropy change when switching to a new model family. Why it matters: The method is powerful but not plug-and-play for every setting; it needs minimal, informed setup.
🍞 Anchor: On some models, choosing the wrong layer or training it too long felt like overtightening a string—performance dropped fast.
🍞 Hook: Picture packing for a trip. You need a suitcase (compute), a checklist (metrics), and a map (which layer to pick).
🥬 The Concept (Required Resources):
- Access to internal hidden states and the unembedding (output) mapping.
- An RL stack (e.g., PPO/GRPO-style updates, reward calculators) and enough compute for multiple rollouts.
- Basic tools to compute entropy and track entropy change. Why it matters: This is within reach for most open-source LLM training setups, but not trivial on locked-down APIs.
🍞 Anchor: Frameworks like veRL or similar RL toolkits make it straightforward to add the brief internal phase before standard training.
🍞 Hook: Imagine a pop quiz where the answer is obvious; practicing the whole textbook first is overkill.
🥬 The Concept (When NOT to Use):
- Very short tasks where reasoning depth is tiny.
- Models or APIs that don’t expose internal states.
- Situations where you can’t monitor entropy/length signals (hard to detect over-training). Why it matters: BuPO shines on deeper reasoning; for simple tasks, standard methods may suffice.
🍞 Anchor: For a one-step QA with obvious answers, BuPO’s careful inner coaching won’t add much.
🍞 Hook: Think of a treasure map with some missing pieces—you can see the path, but not the whole island yet.
🥬 The Concept (Open Questions):
- Can we automate which layer to pick, perhaps with a quick entropy scan or a learned selector?
- How does BuPO interact with chain-of-thought prompting or tool use?
- Can we extend internal training to multiple layers in sequence without instability?
- How do different architectures (Mixture-of-Experts, state-space models) express internal entropy dynamics? Why it matters: Answering these will turn BuPO from a powerful trick into a standard part of reasoning model training.
🍞 Anchor: A future “auto-BuPO” could pick layers and steps on the fly, making bottom-up alignment nearly hands-free.
06Conclusion & Future Work
🍞 Hook: Imagine teaching a team by first strengthening the basics, then perfecting the final play. The game becomes smoother and the wins come more often.
🥬 The Concept (3-Sentence Summary): This paper shows that LLMs secretly contain internal policies at every layer and module, each producing a real next-token distribution. Measuring how uncertainty flows inside reveals that models like Qwen gradually move from exploration to convergence, while others like Llama often converge abruptly at the end. Leveraging this, Bottom-up Policy Optimization (BuPO) briefly aligns a chosen lower-layer policy before whole-model RL, improving hard reasoning performance. How it works:
- Decode internal hidden states into policies and track entropy dynamics.
- Train a carefully selected internal layer policy for a few steps (InterGRPO).
- Switch to standard RL (GRPO) to polish the full model. Why it matters: This bottom-up schedule strengthens foundational reasoning, leading to consistent gains on challenging math benchmarks.
🍞 Anchor: The main achievement is turning inner hidden states into trainable policies and proving that a short, smart bottom-up phase makes the whole model reason better. Next, researchers can automate layer selection, explore multi-layer schedules, and test new architectures. Remember this work because it changes how we think about training: don’t only polish the finish—shape the foundation where reasoning truly begins.
Practical Applications
- •Train math-tutoring LLMs with a short BuPO phase to improve step-by-step reasoning robustness.
- •Use entropy and entropy change scans to pick which internal layer to align in new model families.
- •Deploy monitored training where response length and entropy guardrails stop over-training the internal phase.
- •Apply BuPO to code-generation models to stabilize multi-step algorithm synthesis before full RL.
- •Guide domain adaptation (e.g., physics word problems) by aligning an exploratory FFN layer first.
- •Use internal modular policies to debug: check whether attention or FFN is failing to explore or converge.
- •Design layer schedules that align multiple internal layers briefly, one after another, for very deep models.
- •Enhance safety alignment by encouraging early layers to filter unsafe continuations before the final layer.
- •Improve sample efficiency on reasoning tasks by creating stronger foundations with fewer RL updates.
- •Automate layer selection with a quick internal entropy probe before long training runs.