Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

Seijin Kobayashi; Yanick Schimpf; Maximilian Schlegel; Angelika Steger; Maciej Wolczyk; Johannes von Oswald; Nino Scherrer; Kaitlin Maile; Guillaume Lajoie; Blake A. Richards; Rif A. Saurous; James Manyika; Blaise Agüera y Arcas; Alexander Meulemans; João Sacramento

Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

Intermediate

Seijin Kobayashi, Yanick Schimpf, Maximilian Schlegel et al.12/23/2025

arXiv PDF

Key Summary

•The paper shows that big sequence models (like transformers) quietly learn longer goals inside their hidden activations, even though they are trained one step at a time.
•A small helper network, called a metacontroller, learns to read and gently steer those hidden activations to run multi-step ‘mini-plans’ (temporal abstractions).
•These mini-plans can be turned on for many time steps and then switched off with a learned stop signal, so the model acts in chunks instead of token-by-token.
•Reinforcement learning is then done internally, directly over these mini-plans, not raw actions, which shrinks the search space and makes credit assignment much easier.
•On tough, sparse-reward tasks in grid worlds and MuJoCo Ant, standard RL fine-tuning fails, but internal RL succeeds with much higher and faster success rates.
•Simple linear ‘nudges’ in the middle of the model’s residual stream are enough to execute meaningful subgoals like “go to the blue tile,” and these combine compositionally.
•A non-causal (future-looking) training phase helps the metacontroller discover clean, reusable action chunks without any subgoal labels.
•A rate–distortion analysis shows freezing the pretrained model is key: it preserves subgoal-aligned internal structure that co-training tends to wash out.
•This approach suggests a general recipe for hierarchical RL in foundation models: discover latent actions by SSL, then do RL over those actions internally.
•Beyond robotics, this might help language models plan over longer thoughts by compressing time into fewer, smarter decisions.

Why This Research Matters

Many real problems pay off only at the end—like finishing a robot assembly or solving a multi-step math proof. Exploring tiny steps one by one wastes time and rarely reaches the goal. This work shows that big pretrained models already hide useful long actions inside their activations, and we can discover and control those actions without labels. Doing RL over these longer actions makes learning much faster and more reliable in sparse-reward settings. That can speed up robotics, planning, and even language model reasoning. It’s a practical path toward agents that plan in meaningful chunks rather than jittering through small moves.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how you don’t think about every tiny muscle when you tie your shoes—you just run a short routine called “tie shoes,” then move on? That shortcut saves your brain tons of work.

🥬 Filling (The Actual Concept):

What it is: This paper is about getting AI to use longer “routines” instead of thinking one tiny action at a time, so it can solve hard, long puzzles faster.
How it works: Start with a powerful sequence model trained to predict the next action. Peek inside its hidden activity to find patterns that already hint at longer goals. Then add a small helper (a metacontroller) that learns to turn those hidden patterns into reusable multi-step routines and decides when to switch between them. Finally, do reinforcement learning (RL) directly over those routines, not low-level actions.
Why it matters: If an AI only explores by changing tiny steps, it almost never stumbles onto the long chain needed for a distant reward (like spelling a 10-letter password right on the first try). Chunked routines make the search much smaller.

🍞 Bottom Bread (Anchor): Imagine a robot in a maze needing to touch colored pads in order. Instead of choosing every single footstep, it picks “go to blue,” then “go to red,” and so on—like choosing chapters instead of words.

🍞 Top Bread (Hook): You know how modern chatbots guess the next word one by one? That’s called next-token prediction.

🥬 Filling (The Concept: Autoregressive Model):

What it is: An autoregressive model predicts the next piece of a sequence from what came before.
How it works: It reads the past tokens, builds a hidden summary, and outputs a probability for the next token.
Why it matters: This training teaches the model lots of patterns—and, surprisingly, some longer-term structure shows up inside its hidden layers.

🍞 Bottom Bread (Anchor): Like finishing the sentence “Peanut butter and …” with “jelly,” because your hidden knowledge says those go together.

🍞 Top Bread (Hook): Think of cleaning your room. You don’t plan each finger move—you use chunks: “pick up clothes,” then “make bed,” then “take out trash.”

🥬 Filling (The Concept: Hierarchical RL and Temporal Abstractions):

What it is: Hierarchical RL breaks big tasks into reusable, longer actions (temporal abstractions), sometimes called options.
How it works: A high-level policy chooses which long action to run; a low-level controller executes it for many steps until it stops.
Why it matters: Exploring over long actions is way faster than trying every tiny movement.

🍞 Bottom Bread (Anchor): In a game, “get the key” is a better plan than “move right, right, up, up, left, …”

🍞 Top Bread (Hook): Imagine a treasure that only appears after 100 perfect moves in a row. Random baby steps won’t find it.

🥬 Filling (The Problem: Sparse Rewards with Token-by-Token Exploration):

What it is: RL often gives rewards only at the very end, which makes learning from random tiny changes painfully slow.
How it works: If success needs a long exact sequence, the odds of guessing it by twitching one token at a time are tiny.
Why it matters: Many real tasks (robotics, reasoning) have distant rewards, so we need smarter exploration.

🍞 Bottom Bread (Anchor): Guessing a 10-digit code by changing one digit at a time is hopeless; trying whole code patterns is smarter.

🍞 Top Bread (Hook): What if the long actions we need are already hiding inside the model’s brain?

🥬 Filling (The Gap and Idea):

What it is: The paper claims next-token models quietly learn internal signals for longer goals, even without labels.
How it works: Use a metacontroller to read hidden activations, turn them into simple internal controllers that run for many steps, and learn when to switch.
Why it matters: This unlocks hierarchical RL directly inside foundation models, without hand-made subgoal labels.

🍞 Bottom Bread (Anchor): It’s like discovering your calculator has a hidden “solve quadratic” button; you just had to map the right keys.

🍞 Top Bread (Hook): Why should we care?

🥬 Filling (Real Stakes):

What it is: Faster learning with fewer tries on complex tasks.
How it works: Act in big, meaningful chunks learned from past behavior, then use RL over those chunks.
Why it matters: This can speed up robots learning new chores, agents solving multi-step puzzles, and even language models planning multi-sentence arguments.

🍞 Bottom Bread (Anchor): Think of assembling LEGO: having pre-built sub-assemblies makes building the castle much faster than starting from single studs every time.

02Core Idea

🍞 Top Bread (Hook): Imagine you have a remote that doesn’t just move your robot one step, but can run a whole mini-routine like “walk to the door,” and you can chain these routines together.

🥬 Filling (The Aha! Moment):

What it is (one sentence): Use a small metacontroller to discover and trigger multi-step action chunks hidden inside a pretrained autoregressive model, then do RL over those chunks instead of raw actions.
How it works (recipe):
1. Pretrain a sequence model on next actions from expert demonstrations (no labels for goals).
2. Freeze it and add a metacontroller that reads/writes the model’s residual stream using a simple linear controller.
3. Train this metacontroller with a non-causal, self-supervised objective to discover a compact code for “abstract actions” and a switching gate that decides when to change them.
4. For new tasks, run RL internally over the abstract-action codes and their switches, not over raw tokens.
Why it matters: This shrinks the search space (fewer, smarter actions) and shortens the time horizon (each action runs for many steps), making sparse-reward RL actually learn.

🍞 Bottom Bread (Anchor): In a maze, picking “go to blue pad” then “go to red pad” beats picking 500 tiny footsteps.

🍞 Top Bread (Hook): Three ways to picture it.

🥬 Filling (Multiple Analogies):

Analogy 1: Chapters vs. letters. Writing with chapters (abstract actions) is faster than deciding letter by letter (raw actions).
Analogy 2: App shortcuts. Tapping a “send photo” shortcut (abstract action) beats opening three apps and six menus (token-by-token steps).
Analogy 3: Cookie cutters. Instead of shaping each cookie by hand (raw actions), use cutters (abstract actions) to quickly make perfect shapes.

🍞 Bottom Bread (Anchor): A robot with “go-to-room,” “pick-up,” “place” actions learns a cleaning chore much faster than choosing tiny motor torques every millisecond.

🍞 Top Bread (Hook): Where do these abstract actions live inside the model?

🥬 Filling (Residual Stream Control):

What it is: The residual stream is the model’s main hidden highway; a linear controller nudges it to aim for a specific subgoal over many steps.
How it works: Insert a small, learned linear map mid-layer that slightly shifts hidden activations; the rest of the model turns that into the right low-level moves.
Why it matters: A tiny, simple nudge in the middle is enough to steer long behavior—no need to rewrite the whole model.

🍞 Bottom Bread (Anchor): Like tapping a steering nudger in a self-driving system at the right layer so the car naturally takes the freeway exit you want.

🍞 Top Bread (Hook): How does the metacontroller know when to keep going or to switch routines?

🥬 Filling (Switching Gate β):

What it is: A learned switch that stays low to continue the current abstract action and spikes high to change to a new one.
How it works: At each step, propose a new abstract-action code; blend it with the previous one by a gate between 0 and 1. Low = keep going; high = switch.
Why it matters: Without a good stop/start signal, the model would jitter every step and lose the benefit of long chunks.

🍞 Bottom Bread (Anchor): Like holding a song on repeat (keep going) and pressing skip (switch) only when the chorus ends.

🍞 Top Bread (Hook): But how does it learn these codes without any goal labels?

🥬 Filling (Future-Conditioned, Self-Supervised Discovery):

What it is: During training, the metacontroller looks at the whole sequence (future too) to learn a compact set of codes that best predicts next actions, with a regularizer that encourages simple, reusable chunks.
How it works: It encodes the sequence, proposes a code, uses the linear controller to nudge the residual stream, and is rewarded (self-supervised) when next-action prediction improves; a bottleneck penalty keeps the code small and general.
Why it matters: This lets the system find the right “verbs” (go-to-blue, go-to-red) on its own.

🍞 Bottom Bread (Anchor): Like studying a full chess game to spot typical opening patterns, then learning to reuse those patterns in new games.

🍞 Top Bread (Hook): What really changes before vs. after?

🥬 Filling (Before vs. After):

Before: RL explores by twitching low-level moves—rarely lucky in long, sparse tasks.
After: RL chooses between a handful of smart, long actions—success shows up quickly, and learning sticks.
Why it works (intuition): Long actions dramatically reduce the number of decisions (shorter horizon) and carry built-in structure (cleaner feedback), so it’s easier to tell which choice deserved the credit.

🍞 Bottom Bread (Anchor): It’s the difference between choosing 5 bus routes vs. choosing 5,000 walking steps to cross a city.

03Methodology

🍞 Top Bread (Hook): Recipe time—let’s bake a smarter agent using action chunks instead of crumbs!

🥬 Filling (High-Level Pipeline):

What it is: Input → Pretrain next-action model → Freeze it → Train metacontroller to discover abstract actions → Do internal RL over those actions → Output a hierarchical agent.
How it works, step by step:
1. Collect expert trajectories (observations and actions, no rewards or subgoal labels).
2. Train an autoregressive model (transformer for grid, SSM for Ant) to predict next action (and optionally next observation).
3. Freeze that model; insert a simple linear controller at a middle layer (best controllability observed mid-depth).
4. Train a metacontroller to read the model’s hidden states, propose a compact abstract-action code, and decide when to switch; it’s self-supervised and allowed to peek at the future during training so it can discover clean chunks.
5. For new tasks with sparse rewards, run RL internally over those abstract-action codes and switch signals.
Why it matters: Each stage builds a scaffold so RL doesn’t get lost in tiny-step space.

🍞 Bottom Bread (Anchor): Like learning a language: first learn words (pretrain), then learn phrases (metacontroller), then write essays by picking the right phrases (internal RL).

Step 1. Pretrain next-action predictor

What happens: Train on expert demos to predict the next action (and optionally the next observation) from the history.
Why this step exists: It gives the model a rich action prior and hidden representations that implicitly encode goals.
Example with data: In grid world, inputs are board states and past moves; outputs are one of four directions. In Ant, inputs include joint sensors and positions; outputs are joint torques.

🍞 Sandwich: Autoregressive Model

You know how you finish your friend’s sentence because you’ve heard them talk before?
The model predicts the next action one step at a time, building a hidden summary of what’s likely. It reads the past, updates a hidden state, and outputs a probability over actions. Without it, the system wouldn’t have the patterns needed to discover longer goals hidden inside.
Anchor: Guessing the next chess move after seeing the board and recent moves.

Step 2. Find evidence of temporal abstractions inside

What happens: Use linear probes (read-outs) to see if hidden states linearly predict the current subgoal; use causal interventions (linear controllers) to test if nudging the hidden state makes the agent pursue a chosen subgoal.
Why this step exists: To verify the model already encodes longer-term intent we can steer.
Example with data: A mid-layer probe increasingly predicts “go-to-blue” as the agent moves; a mid-layer linear nudge makes the model actually head to blue.

🍞 Sandwich: Linear Probing

Imagine asking a simple quiz: “Can I guess your goal from your face?”
A linear probe is a simple classifier attached to a hidden layer that tries to read out a label (like subgoal). It trains quickly and reports how much info is present. Without it, we’d be guessing whether the model knows anything about subgoals.
Anchor: Reading a sports fan’s team from the colors they wear—simple, but often right.

🍞 Sandwich: Causal Intervention via Linear Controller

Think of lightly tapping a steering wheel to see if the car changes lanes.
We insert a tiny linear map at a mid-layer to add a small bias to the hidden state. The rest of the model turns that into a long behavior like “head to red.” Without this, we can’t test if internal signals are controllable.
Anchor: Turning a radio knob a bit and seeing the station change.

Step 3. Train the metacontroller (self-supervised, non-causal)

What happens: Freeze the base model. A metacontroller reads hidden states, proposes a compact abstract-action code z, decodes it into a linear controller, and uses a switch gate β to keep or change the code. During training, it can look at the whole sequence (future included) and is rewarded when its nudges improve next-action prediction; a bottleneck penalty encourages simple, reusable codes.
Why this step exists: To discover clean, time-extended action chunks and when to switch, without any labels.
Example with data: In Ant, the switch β spikes right when the agent leaves “go-to-blue” and starts “go-to-red,” even though no subgoal labels were provided.

🍞 Sandwich: Metacontroller

You know how a movie director chooses which scene to run and when to cut to the next?
The metacontroller reads the model’s hidden state, picks an abstract-action code, turns it into a tiny linear nudge, and decides when to switch. Without a director, scenes (actions) would cut every second or never cut at all.
Anchor: A conductor holding a note (keep) and signaling a new phrase (switch) at the right time.

🍞 Sandwich: Future-Conditioned Discovery and Bottleneck

Like studying the whole song before deciding where the chorus starts and ends.
During training, the metacontroller can peek ahead to learn meaningful chunks, while a regularizer keeps the code space small so the chunks become reusable. Without future peeking and a bottleneck, chunks would be messy and not generalize.
Anchor: Marking paragraph breaks only after reading the whole essay makes better sections.

Step 4. Internal RL over abstract actions

What happens: For new, sparse-reward tasks, treat the frozen base model plus decoder as the environment. Learn a policy over the abstract code z and discrete switches; keep a code running until switch says change. Rewards backpropagate over far fewer decisions.
Why this step exists: It shortens horizons and reduces action-space size, making sparse-reward learning tractable.
Example with data: In grid world and Ant, standard RL from the base model fails within a million episodes, while internal RL quickly rises to high success.

🍞 Sandwich: Internal RL

Picture a game where, instead of choosing every step, you choose which mini-quest to run next.
Internal RL selects among discovered abstract-action codes and when to switch them, while the frozen model executes the details. Without it, RL gets lost in tiny choices and never reaches the treasure.
Anchor: Choosing “take the bus to downtown” instead of micromanaging every crosswalk.

Secret Sauce (what’s clever):

Mid-layer linear control: surprisingly strong leverage point.
Switch gate β: clean, sparse segmentation without explicit penalties.
Future-conditioned, self-supervised discovery with a bottleneck: finds reusable, goal-aligned chunks.
Freeze-then-control: preserves subgoal-friendly structure that co-training can blur.

04Experiments & Results

🍞 Top Bread (Hook): Let’s see if action chunks really beat baby steps where it counts—on hard games with almost no hints.

🥬 Filling (The Test):

What it is: Two families of tasks with hidden hierarchy and sparse rewards.
- Gridworld PinPad: Visit colored tiles in a required order while avoiding walls; success only if the full sequence is correct.
- Ant PinPad (MuJoCo): Control a quadruped to step on colored tiles in order; continuous observations and torques make it harder.
Why these tests: They need long-horizon planning and compositionality (combine subgoals in new orders). Rewards come only at the end, so random token-by-token exploration is almost hopeless.

🍞 Bottom Bread (Anchor): It’s like a lock that only opens after pressing colors in the right order—one mistake and you get nothing.

🍞 Top Bread (Hook): Who’s competing?

🥬 Filling (The Competition):

Baselines:
- Standard RL finetuning on the pretrained model’s raw actions (adapted GRPO/PPO-style): strong modern baseline for sparse rewards.
- CompILE: a hierarchical RL method that discovers segments from demos via variational inference (no residual-stream control).
- Co-training variant: train base model and metacontroller together, instead of freeze-then-control.
- No-temporal-integration ablation: force switching every time step (β=1) to test if chunking matters.
Why they’re here: To check if our wins come from the internal-RL idea, the discovery method, or just any hierarchy trick.

🍞 Bottom Bread (Anchor): Think of it as a race: walkers (token steps), runners without pacing (switch every step), and marathoners who use smart mile-long strides (our method).

🍞 Top Bread (Hook): What’s on the scoreboard?

🥬 Filling (Results with Context):

Internal RL wins big:
- Gridworld: Internal RL rapidly climbs to high success rates (like getting an A+), while standard RL and CompILE hover near zero (a failing grade), even after massive training.
- Ant: Same story—internal RL succeeds, others basically flatline.
Context: Success here means completing long, unseen subgoal orders. Achieving high success in these sparse settings is like sinking a half-court shot repeatedly while others barely hit the rim.
Surprising Findings:
- Mid-layer control is best: linear controllers inserted near the middle layer lead to stronger, more general control than at the top.
- Discovery is truly unsupervised: The switch β aligns almost perfectly with subgoal changes, even in continuous Ant, without any subgoal labels.
- Freeze is crucial: Rate–distortion curves show a clear region where subgoal-aligned switching appears only when the base model is frozen; co-training tends to collapse into degenerate switching (one early switch).
- Temporal integration matters: Forcing β=1 (switch every step) can start okay (explores more) but can’t assign credit well; only real chunking learns steadily.

🍞 Bottom Bread (Anchor): It’s like trying a puzzle: random wiggling (raw RL) never solves it; snapping in the right big pieces (internal RL over chunks) finishes the picture fast.

🍞 Top Bread (Hook): Why does this work so well?

🥬 Filling (Intuition):

Shorter horizon: Choosing 10 chunks is far easier than 1,000 atomic steps.
Smaller action space: A compact z-code replaces high-dimensional raw actions.
Better credit: When a chunk finishes, you can tell whether it helped, so learning signals are clearer.
Frozen backbone: Keeps subgoal-friendly structure learned during next-action pretraining intact.

🍞 Bottom Bread (Anchor): Like grading a team by plays (chunks) instead of grading each footstep—much easier to see what worked.

05Discussion & Limitations

🍞 Top Bread (Hook): Even the best tools have a ‘do not use like this’ tag.

🥬 Filling (Honest Assessment):

Limitations:
- Needs a good pretrained backbone: If the base model didn’t see related behaviors, the hidden subgoals may be weak or absent.
- Domain shift: Far-out-of-distribution tasks could require re-discovery or additional SSL passes.
- Design choices: Mid-layer location, code size, and KL weight need tuning; too strong or too weak bottlenecks hurt.
- Non-causal training: The metacontroller peeks into the future only during discovery; switching must be causal at test, which could be tricky in ultra-ambiguous settings.
Required Resources:
- Expert-like demos to pretrain the sequence model.
- Enough compute for SSL discovery and then RL; less than full RL-from-scratch, but not tiny.
When NOT to Use:
- If rewards are dense and short-horizon, plain RL or behavior cloning might be simpler.
- If you can’t freeze or trust the backbone’s internal representations.
- If no coherent temporal abstractions exist (fully chaotic tasks), chunking may not help.
Open Questions:
- Scaling to large LLMs and long reasoning tasks: Can we discover “thought-chunks” that boost planning?
- Safety and control: How to constrain internal controllers to avoid harmful behaviors?
- Interpretability: Can we name and audit discovered chunks at scale, like options with tooltips?
- Continual learning: Can we alternate SSL and internal RL over many cycles to grow a library of ever-better chunks?

🍞 Bottom Bread (Anchor): Don’t bring a bulldozer to plant a flower—use this when you truly need long, reusable actions and sparse-reward learning.

06Conclusion & Future Work

🍞 Top Bread (Hook): Think of this as teaching an AI to use sentences, not letters, when it acts.

🥬 Filling (Takeaway):

3-Sentence Summary: This paper shows that autoregressive models quietly learn signals for long, meaningful actions inside their hidden states. A metacontroller can discover and trigger these actions with a simple mid-layer nudge and a learned switch, without any subgoal labels. Doing RL over these discovered action chunks—internal RL—solves sparse, hierarchical tasks where standard RL fails.
Main Achievement: Turning hidden, unlabeled structure in a pretrained model into a practical library of reusable, temporally-extended actions that make RL efficient.
Future Directions: Scale to language reasoning (discover ‘thought-chunks’), add safety and interpretability tools for internal control, and iterate SSL↔RL cycles to grow richer hierarchies.
Why Remember This: It’s a blueprint for hierarchical RL inside foundation models—discover chunks by self-supervision, then learn by choosing among them—transforming long, frustrating searches into a few strategic moves.

🍞 Bottom Bread (Anchor): From tiptoeing across a map to hopping between waypoints, this work shows how to make AI explore like a planner, not a jittery guesser.

Practical Applications

•Robot skill libraries: Discover reusable routines like ‘pick-and-place’ or ‘navigate-to-zone’ from demonstration logs, then learn new tasks by recombining them.
•Warehouse navigation: Use abstract actions (go-to-aisle, go-to-dock) to plan efficient routes in changing layouts with fewer trial-and-error steps.
•Household assistants: Learn routines such as ‘set table’ or ‘load dishwasher’ as internal chunks, enabling faster adaptation to new homes.
•Industrial inspection: Encode patrol patterns and targeted checks as abstract actions to cover complex facilities with fewer mistakes.
•Game agents: Build subgoal policies (get key, unlock door, reach exit) that generalize to new levels and layouts.
•Autonomous driving: Represent maneuvers (merge, exit, pass) as internal actions to improve planning under sparse supervision.
•LLM reasoning: Discover ‘thought-chunks’ (prove lemma, simplify expression) to help solve multi-step problems with fewer attempts.
•Tutoring systems: Use abstract pedagogical moves (review concept, give hint, test recall) to adapt teaching over longer sessions.
•Scientific workflows: Encode experimental subroutines (prepare sample, run assay, analyze) as chunks for lab automation.
•Healthcare triage bots: Use abstract decision flows (assess symptom cluster, request test, recommend care) to handle long consultations.

Version: 1