Do Reasoning Models Enhance Embedding Models?
Key Summary
- ā¢The paper asks a simple question: if a language model becomes better at step-by-step reasoning (using RLVR), do its text embeddings also get better? The short answer is no.
- ā¢Across big benchmarks (MTEB Multilingual, MTEB Code, and BRIGHT), embeddings made from reasoning-tuned backbones perform about the same as those made from the original base models.
- ā¢To explain this surprise, the authors create HRSA, a three-level microscope (representation, geometry, function) that shows what really changes inside the model.
- ā¢HRSA finds that RLVR mostly keeps the global shape of knowledge the same but rearranges nearby relationships (local geometry) and can slightly rotate the coordinate system with long training.
- ā¢When both backbones are later trained as embedding models with the same contrastive recipe, they āsnap backā to each other at the global level, a phenomenon the authors call Manifold Realignment.
- ā¢Supervised Fine-Tuning (SFT) is very different: it actually reshapes the global map, mixes features more, and can hurt transfer of simple linear readouts.
- ā¢This means better āthinkingā (RLVR reasoning) does not automatically mean better āgroupingā (embeddings), because RLVR optimizes how the model moves through an existing map instead of redrawing the map.
- ā¢Practically, if you want better embeddings, your contrastive training recipe matters far more than whether the backbone was RLVR-tuned for reasoning.
- ā¢The HRSA toolkit (CKA, k-NN overlap, Procrustes, cross-model probes) is useful beyond text and could diagnose representation changes in vision or audio models, too.
Why This Research Matters
This work tells practitioners they donāt get better embeddings automatically by starting from a reasoning-tuned model. That can save large amounts of compute and time by focusing effort on the contrastive training recipe, data curation, and negative mining. Teams building search, recommendation, and retrieval systems can select cheaper base backbones without losing quality. HRSA offers a practical diagnostic kit to understand how training changes the hidden map inside models, guiding better choices across modalities. The idea of Manifold Realignment also hints that downstream objectives can override some pretraining differences, helping unify model families during deployment.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: You know how you can be great at solving puzzles but not necessarily great at organizing your bookshelf? Solving and sorting are different skills.
š„¬ The Concept (Embeddings):
- What it is: A text embedding is a vector that places a sentence like āCats purrā somewhere in a big space so similar ideas land close together.
- How it works: 1) Read the sentence, 2) turn each word into numbers, 3) mix the signals across layers, 4) pool into one vector, 5) use distances to compare meaning.
- Why it matters: Without embeddings, search engines, chat memory, and recommendation systems wouldnāt know that ādoctorā is closer to āhospitalā than to āvolcano.ā
š Anchor: When you search for ābest way to fix a bike chain,ā good embeddings help fetch guides about bike repair, not recipes for donuts.
The world before this paper: People built strong embeddings by adapting large language models (LLMs) and then training them with contrastive learning so that matching pairs (like a question and its answer) pull together and non-matching pairs push apart. At the same time, new āreasoningā models were created using Reinforcement Learning with Verifiable Rewards (RLVR). These models got much better at step-by-step thinking, math chains-of-thought, and careful deliberation.
š Hook: Imagine a student earns lots of gold stars for correct step-by-step solutions. You might expect theyāll also file their notes better.
š„¬ The Concept (RLVR):
- What it is: RLVR is a training method that rewards a model for verifiably correct outcomes (like correct math answers), not just for sounding nice.
- How it works: 1) Model attempts a solution, 2) a checker verifies correctness, 3) the model gets a reward or penalty, 4) weights are nudged to increase future correct attempts.
- Why it matters: Without RLVR, models may learn to āsound confidentā rather than to ābe correct.ā
š Anchor: Itās like practicing math with an answer key: you only get points when the final answer is truly right.
This led to a big question: If a model becomes a better thinker (thanks to RLVR), will that also give us better embeddings? It feels intuitiveābetter reasoning should mean better mapping of meaning, right?
š Hook: Suppose you have a treasure map of an island (the ālatent manifoldā), and you learn to hike better. Do you also get a better map?
š„¬ The Concept (Latent Manifold):
- What it is: The latent manifold is the āshapeā formed by all embeddings in high-dimensional space.
- How it works: 1) Turn many texts into vectors, 2) these vectors form clouds and clusters, 3) the arrangement captures relationships among ideas, 4) the shape guides whatās close or far.
- Why it matters: If the manifold is sloppy, similar meanings wonāt end up near each other, hurting retrieval and clustering.
š Anchor: Think of a subway map: stations with similar functions (downtown hubs) cluster; the mapās shape guides how you travel.
Before this paper, people tried two main routes to reasoning: SFT (Supervised Fine-Tuning) on labeled reasoning data and RLVR. SFT can strongly change what the model focuses on, sometimes bending the manifold in big ways. RLVR, anchored by rewards and stability constraints, tends to keep the original modelās knowledge but polishes how it chooses steps.
š Hook: Imagine two study strategies: one rewrites your whole notebook (SFT), the other teaches you better test-taking routes through what you already know (RLVR).
š„¬ The Concept (SFT):
- What it is: SFT trains a model on examples with the ārightā answers so it imitates them.
- How it works: 1) Show input+label pairs, 2) penalize wrong outputs, 3) repeat until outputs match labels.
- Why it matters: Without SFT, a model might never specialize or learn desired task styles.
š Anchor: Itās like practicing with a tutor who says, āCopy this exact method and answer.ā
š„¬ The Concept (Contrastive Learning):
- What it is: Contrastive learning teaches the model to pull matching items together and push mismatched ones apart in embedding space.
- How it works: 1) Pick a āqueryā and its āpositiveā match, 2) add ānegatives,ā 3) increase similarity of the positive pair, 4) decrease similarity to negatives.
- Why it matters: Without it, embeddings wonāt reliably reflect āwho belongs with whom.ā
š Anchor: Itās like sorting socks: you pair matching socks tightly and keep different socks apart.
The problem: Despite stronger reasoning, do RLVR-tuned backbones actually improve embeddings? The authors tested this carefully by training embedding models from matched base vs. RLVR-tuned backbones using the exact same contrastive recipe and then comparing performances on MTEB (Multilingual and Code) and BRIGHT (reasoning-heavy retrieval).
Failed attempts and intuition clashes: Many expected RLVR to help embeddings. But the experiments showed a ānull effectāāno consistent gains over base backbones when both are trained equally as embedding encoders.
The gap: Benchmark scores alone canāt explain why the results are the same. Are the internal representations truly identical, or do they differ in ways the scores canāt see?
Real stakes: This affects how we build efficient systems. If RLVR doesnāt boost embeddings, teams can save compute by starting from base models for encoders. It also nudges researchers to focus on the embedding training recipe (data, negatives, pooling) rather than expecting reasoning tweaks to carry over automatically.
02Core Idea
š Hook: You know how two kids can take different paths through a museum but end up seeing the same rooms? The route changes, but the building stays the same.
š„¬ The Concept (Key Insight):
- What it is: The big idea is that RLVR improves how a model travels through its knowledge but doesnāt redraw the overall map that embeddings rely on.
- How it works: 1) RLVR preserves the global shape of the latent manifold, 2) it reorganizes local neighborhoods (whoās next to whom), 3) prolonged RLVR can rotate coordinates a bit, 4) when you later do contrastive learning for embeddings, both base and RLVR-initialized models āsnapā to a similar global arrangement.
- Why it matters: If the global shape stays the same, embedding quality wonāt jump just because reasoning improved.
š Anchor: Two hikers choose different trails but still tour the same island. Their trail skills got better, but the islandās map didnāt change.
Multiple analogies for the same idea:
- City Map vs. Driving Style: RLVR is like improving your drivingāsmoother turns and better mergingāwhile the city map (global geometry) stays fixed. Embeddings care about the map, not your driving flair.
- Library Shelves vs. Reading Strategy: You learn a sharper reading strategy (RLVR), but the library shelves (manifold) mostly keep their order. Embedding tasks depend on shelf order.
- Puzzle Table vs. How You Reach Pieces: RLVR helps you reach the right puzzle pieces more efficiently, but the piecesā positions (global geometry) barely change.
š Hook: Imagine a three-level microscope that compares two models: close-up details, overall shape, and what they can do.
š„¬ The Concept (HRSA):
- What it is: HRSA (Hierarchical Representation Similarity Analysis) compares models at three levelsārepresentation, geometry, and function.
- How it works: 1) Representation: do features line up along the same axes? 2) Geometry: is the overall shape of points the same (globally and locally)? 3) Function: do the same simple readouts still work?
- Why it matters: Benchmarks can hide differences. HRSA reveals what changed and what didnāt.
š Anchor: Itās like checking if two songs match by 1) the instruments (representation), 2) the melody shape (geometry), and 3) whether people can dance the same steps to both (function).
Within HRSA:
š Hook: Picture two coordinate grids drawn over the same map.
š„¬ The Concept (Representation Level):
- What it is: Checks whether two models use similar axes (coordinate basis) for their features.
- How it works: 1) Compare features dimension-by-dimension, 2) allow a single rotation (Procrustes) to see if axes can be matched, 3) measure how āone-to-oneā the mapping is.
- Why it matters: If axes drift a lot, features are mixed, which can complicate transfer.
š Anchor: Two math notebooks might describe the same triangle but label sides differently; representation-level asks if labels still align.
š Hook: Now zoom out to look at the whole shape of the data cloud.
š„¬ The Concept (Geometry LevelāGlobal vs. Local):
- What it is: Global geometry is the big-picture shape; local geometry is each pointās nearest neighbors.
- How it works: 1) Global shape checked by CKA (are pairwise relations preserved up to rotation/scale?), 2) local checked by k-NN overlap (do nearest neighbors stay the same?).
- Why it matters: Even if the big shape stays similar, small neighbor swaps can change path planning or local decisions.
š Anchor: Your city might keep the same districts (global), but some side streets change direction (local).
š Hook: Finally, ask what tasks you can do with a simple tool.
š„¬ The Concept (Function Level):
- What it is: Tests whether a simple linear readout learned on one model still works on the other.
- How it works: 1) Train a linear probe on Model A, 2) freeze it, 3) apply it to Model B, 4) compare performance to Aās own.
- Why it matters: If the same readout works, the models are functionally aligned for that task.
š Anchor: Itās like teaching one vending machine a coin detector and seeing if the same detector also works on the other machine.
What changes because of this idea:
- Before: Many believed better reasoning would bring better embeddings āfor free.ā
- After: We learn that embeddings donāt automatically improve because the global manifoldāthe part embeddings rely onāstays much the same under RLVR.
Why it works (intuition):
- RLVRās rewards guide the model to plan better trajectories without tearing up the map. The KL-like anchors and on-policy constraints tend to keep the model near a safe region of its original representation space.
- Contrastive learning is strong at āsnappingā manifolds into a shared alignment when the training data and recipe are the same.
š Hook: Imagine two slightly rotated puzzle boards that become aligned when you press them into the same frame.
š„¬ The Concept (Manifold Realignment):
- What it is: When you train embeddings with the same contrastive recipe, base-initialized and RLVR-initialized models become globally aligned again.
- How it works: 1) Start with two backbones, 2) remove LM head, 3) pool activations, 4) train with identical contrastive data, 5) see the global shapes converge.
- Why it matters: It explains the ānull effectā on benchmarks: after alignment, both encoders perform the same.
š Anchor: Two choirs that practiced different warmups still sing in tune together once they follow the same conductor and sheet music.
03Methodology
At a high level: Text inputs ā Backbone LLM (Base or RLVR-tuned) ā Remove LM head + pooling ā Contrastive learning (InfoNCE) ā Evaluate on benchmarks + Analyze with HRSA (representation, geometry, function).
Step-by-step recipe:
- Choose matched backbones
- What happens: Pick a base LLM and its RLVR-tuned version (same family, RLVR applied directly to the base). Include SFT pairs for contrast.
- Why this step exists: A fair A/B test requires only one differenceāthe training (base vs. RLVR/SFT)āso we can isolate its effect on embeddings.
- Example: Qwen2.5-1.5B (base) vs. Qwen2.5-1.5B-SimpleRL-Zoo (RLVR), plus SFT controls.
- Turn LLMs into embedding encoders
- What happens: Remove the language modeling head, enable bidirectional attention, take final-layer hidden states, and mean-pool into a fixed-size vector.
- Why this step exists: We want a pure encoder behavior that outputs one vector per text for retrieval/clustering tasks.
- Example: A question and a passage get pooled into 1,536-d vectors you can compare with cosine similarity.
š Hook: Imagine training by pairing matching socks and splitting up mismatched ones.
š„¬ The Concept (InfoNCEāContrastive Objective):
- What it is: A loss that pulls a query close to its true match and pushes it away from negatives.
- How it works: 1) For each query, mark a positive passage, 2) add hard negatives, 3) increase queryāpositive similarity, 4) decrease queryānegative similarity, 5) repeat over many batches.
- Why it matters: Without InfoNCE, embeddings wonāt consistently reflect semantic closeness.
š Anchor: Itās like coaching: āthese two go together; keep those apart,ā over and over until the model gets it.
- Train with a strong, identical recipe
- What happens: Use the same data, mining, optimizer, and hyperparameters for both initializations; no LoRA, full-parameter updates.
- Why this step exists: Keeps the playing field level so any performance change comes from the backbone, not the recipe.
- Example: 1.6M training pairs from 11 datasets; mined hard negatives with a 95% positive-aware margin; temperature 0.02; batch size 2048; cosine LR schedule.
- Evaluate embedding performance
- What happens: Test on MTEB (Multilingual v2, Code v1) and BRIGHT for retrieval, clustering, and similarity tasks, including reasoning-heavy retrieval.
- Why this step exists: These broad benchmarks capture general embedding quality and domain-specific performance.
- Example: Report means over 3 seeds; compare deltas base-Emb vs. RLVR-Emb.
- Analyze internals with HRSA
- What happens: Collect token-level, per-layer representations for both models and compare them with representation-, geometry-, and function-level tools.
- Why this step exists: Identical scores can hide very different internal changes; HRSA reveals whatās preserved and whatās reorganized.
- Example: Compute correlation heatmaps across layers, CKA matrices, k-NN overlap across layers, and cross-model probe transfer.
Now the ātoolsā used inside HRSA:
š Hook: Think of checking if two coordinate grids line up.
š„¬ The Concept (Orthogonal Procrustes):
- What it is: A way to find the best rotation to align one feature space with another.
- How it works: 1) Solve for the single best rotation between spaces, 2) measure how close that rotation is to a simple permutation (axes match one-to-one) using inverse row entropy, 3) higher means more axis-aligned.
- Why it matters: If a single rotation nearly matches features, then the coordinate basis is preserved.
š Anchor: Itās like turning a transparency sheet until street names line up over a map.
š Hook: Peek at each feature column to see if it finds a twin.
š„¬ The Concept (Dimension-wise Correlation):
- What it is: Compares each feature dimension in Model A to the same dimension in Model B.
- How it works: 1) For column j, compute correlation of A_j vs. B_j across tokens, 2) average over j.
- Why it matters: High scores mean many features already match one-to-one without mixing.
š Anchor: Itās like checking if notebook line #7 in both copies says the same thing.
š Hook: Step back and check if two point clouds have the same overall shape.
š„¬ The Concept (Linear CKA):
- What it is: A score that says whether two sets of representations share a similar global geometry (up to rotation/scale).
- How it works: 1) Build pairwise similarity matrices (Gram matrices), 2) center them, 3) measure how aligned they are, 4) higher means more similar global shape.
- Why it matters: If CKA stays high, the big map is stable even if axes rotate.
š Anchor: Two constellations look the same even if you slightly rotate the night sky.
š Hook: Now check the nearest friends around each point.
š„¬ The Concept (k-NN Overlap):
- What it is: A way to see if each vector keeps the same closest neighbors across models.
- How it works: 1) Find top-k neighbors by cosine similarity in Model A and B, 2) compute the Jaccard overlap, 3) average across points.
- Why it matters: Neighbor changes = local geometry reorganization.
š Anchor: You moved houses but kept the same city; do you still live next to the same neighbors?
š Hook: Finally, test whether the same simple detector works on both models.
š„¬ The Concept (Cross-Model Linear Probe):
- What it is: Train a linear classifier on Model Aās embeddings and test itāunchangedāon Model B.
- How it works: 1) Fit probe on A for a task (e.g., news topics), 2) apply it to Bās embeddings, 3) compare accuracy drop.
- Why it matters: If the drop is small, both models support the same easy-to-access functions.
š Anchor: A key cut for one lock that also opens the other indicates similar internals.
The secret sauce:
- The identical contrastive recipe and broad evaluation remove confounders.
- HRSAās three lenses pinpoint why scores match: RLVR keeps global geometry and linear readouts stable, but reorganizes local neighborhoods; contrastive learning then realigns global structure across initializations.
04Experiments & Results
The test: The authors trained pairs of embedding models that differed only in the backbone initialization: base vs. RLVR-tuned (and SFT as a contrasting control). They used the same architecture, pooling, contrastive objective, data, and hyperparameters for both. Performance was measured on MTEB Multilingual v2, MTEB Code v1, and BRIGHT. Then HRSA examined internal similarities across layers.
The competition: Baselines were base-initialized embeddings; comparators were RLVR-initialized embeddings using zero-RL style direct tuning (no SFT warm-start). SFT-initialized embeddings served as a control to illustrate a very different footprint (often altering global geometry more).
Scoreboard with context:
- Embedding quality parity: RLVR-initialized embeddings performed essentially the same as base-initialized onesāperformance gaps hovered near zero with tiny standard deviations. Think: two students both scoring 87% while everyone expected the RLVR student to hit 95%.
- SFT differences: In contrast, SFT backbones showed larger shifts (sometimes negative) indicating more aggressive manifold restructuring.
Surprising findings and what HRSA revealed:
- Representation level:
- Dimension-wise correlation: RLVR pairs showed stronger axis-aligned feature correspondence than SFT pairs. Prolonged RLVR could drift axes somewhat, but contrastive training largely restored axis alignment between the resulting embedding models.
- Orthogonal Procrustes: For RLVR pairs, the optimal alignment matrix was close to a permutation (high inverse row entropy), meaning features stayed fairly one-to-one. After contrastive learning, this became even more permutative. SFT pairs, however, needed dense rotations, showing heavy feature mixing.
- Geometry level:
- Linear CKA (global): Stayed high for RLVR pairs but dropped for SFT pairs, indicating RLVR preserves the global manifold shape. After contrastive training, base-Emb and RLVR-Emb moved even closer in CKAāevidence of Manifold Realignment.
- k-NN overlap (local): RLVR preserved local neighborhoods far better than SFT, but overlap was still well below 1. That is, RLVR reshuffled some local neighbors. Notably, contrastive training did not fully undo these local changes; local reorganization remained partly irreversible even as global shapes realigned.
- Function level:
- Cross-model linear probes: Probes trained on one RLVR model transferred better to its pair than SFT probes did, reflecting more stable linear readout directions. For the embedding models, transfer stayed high, confirming functional alignment after contrastive training.
Training dynamics:
- Early āsnapā: During embedding training, manifold realignment happened quickly (early steps), then stabilized. Meanwhile, k-NN overlap did not fully recover, reinforcing that local reorganization from RLVR is persistent even when the global shape re-aligns.
Concrete numbers in plain words:
- Across the RLVR pairs on MTEB and BRIGHT, mean differences were tiny (e.g., ā0.26 ± 0.08 to +0.18 ± 0.04), basically a tie given run-to-run noise.
- SFT comparisons showed bigger shifts and lower CKA, consistent with more destructive or at least more invasive changes to the representation map.
Bottom line results:
- No free lunch from reasoning: Being better at step-by-step thinking (via RLVR) doesnāt automatically yield better embeddings.
- RLVR as trajectory optimizer: It tweaks how the model navigates an existing semantic landscape without remapping the terrain.
- Contrastive training is the great aligner: Given the same data and loss, embedding models converge to similar global geometries regardless of RLVR initialization.
05Discussion & Limitations
Limitations:
- Scope of models and data: Although the study spans several model families, RLVR methods (GRPO, DAPO, PPO-like variants), and big benchmarks (MTEB, BRIGHT), it canāt cover every scale, domain, or training schedule. There may be niche settings (e.g., extremely domain-tailored RLVR) where embeddings benefit.
- Metric sensitivity: HRSA uses principled but still imperfect tools (CKA can be gamed in high dimensions; k-NN depends on k and similarity choice). Conclusions rely on triangulating across multiple metrics rather than trusting a single score.
- Task selection: Function-level probes used standard classification; other downstream tasks (e.g., few-shot retrieval with re-ranking or cross-modal fusion) might surface subtler differences.
Required resources:
- Compute: Full-parameter contrastive training at large batch sizes (e.g., 2048) and token-level HRSA across layers require multi-GPU setups and careful engineering (mixed precision, Flash Attention, checkpointing).
- Data curation: High-quality positive/negative pairs and hard-negative mining strongly affect embedding quality and are essential to replicate the findings.
When not to use RLVR as an embedding booster:
- If your only goal is better embeddings, applying RLVR first is unlikely to help and adds cost; invest that compute in better contrastive data, mining, and training.
- If your application hinges on global manifold reshaping (e.g., domain remapping), SFT or other methods may be more appropriate than RLVR.
Open questions:
- Mechanism of local reorganization: Why does RLVR consistently reshuffle local neighborhoods while keeping global shape stable? Which parts of the reward, KL constraints, or curricula control this?
- Tipping points for basis drift: What training lengths or reward mixes trigger coordinate basis drift, and how reversible is it under different downstream objectives?
- Beyond text: Does manifold realignment appear in vision or audio encoders adapted from RLVR-tuned backbones? HRSA provides a path to check.
- Constrained SFT: Can we design SFT with geometry- and basis-aware regularizers to mimic RLVRās āoptimize trajectories without redrawing mapsā behavior?
06Conclusion & Future Work
Three-sentence summary: This paper shows that making language models better at reasoning using RLVR does not automatically improve the quality of their text embeddings. Using a new three-level analysis tool (HRSA), the authors find RLVR preserves the global map of meanings while reshuffling some local neighborhoods and, under long training, slightly rotating feature axes. When both base and RLVR backbones are trained as embedding encoders with the same contrastive recipe, they realign globally and perform the same.
Main achievement: Introducing HRSA and using it to uncover Manifold Realignment, which explains the ānull effectā by showing RLVR optimizes how models travel through an existing semantic landscape rather than redrawing it.
Future directions: Explore geometry- and basis-aware regularizations for SFT to reproduce RLVRās stable-global/adjusted-local footprint, probe the exact training signals that control local reorganization and coordinate drift, and test HRSA and realignment phenomena in vision and audio.
Why remember this: It resets expectationsābetter āthinkingā isnāt the same as better āgrouping.ā If you want stronger embeddings, focus on the contrastive recipe, data quality, and negatives. HRSA gives the community a clear, reusable lens to diagnose how training changes the hidden map inside models.
Practical Applications
- ā¢Choose base backbones for new embedding encoders when budgets are tightāskip RLVR unless you need reasoning at inference.
- ā¢Invest in better contrastive datasets and hard-negative mining rather than reasoning post-training to lift embedding quality.
- ā¢Use HRSA to audit model changes after fine-tuning, checking whether a method preserves global geometry or mixes features.
- ā¢Prefer contrastive retraining as a unifier step when merging systems built from slightly different backbones.
- ā¢Apply cross-model linear probes to test compatibility between old and new encoders before swapping in production.
- ā¢Use k-NN overlap monitoring to detect local geometry shifts that might affect nearest-neighbor search stability.
- ā¢If you must use SFT, consider geometry-aware regularizers to avoid destructive global reshaping of the manifold.
- ā¢Extend HRSA to vision/audio encoders to debug why two training recipes yield similar or different retrieval results.
- ā¢Track training dynamics early (first 200 steps) to confirm expected manifold realignment and stop training sooner.
- ā¢Design ablations that vary only initialization (base vs. RLVR) to estimate expected returns on embedding tasks.