Do Reasoning Models Enhance Embedding Models?

Wun Yu Chan; Shaojin Chen; Huihao Jing; Kwun Hang Lau; Elton Chun-Chai Li; Zihao Wang; Haoran Li; Yangqiu Song

Do Reasoning Models Enhance Embedding Models?

Intermediate

Wun Yu Chan, Shaojin Chen, Huihao Jing et al.1/29/2026

arXiv PDF

Key Summary

•The paper asks a simple question: if a language model becomes better at step-by-step reasoning (using RLVR), do its text embeddings also get better? The short answer is no.
•Across big benchmarks (MTEB Multilingual, MTEB Code, and BRIGHT), embeddings made from reasoning-tuned backbones perform about the same as those made from the original base models.
•To explain this surprise, the authors create HRSA, a three-level microscope (representation, geometry, function) that shows what really changes inside the model.
•HRSA finds that RLVR mostly keeps the global shape of knowledge the same but rearranges nearby relationships (local geometry) and can slightly rotate the coordinate system with long training.
•When both backbones are later trained as embedding models with the same contrastive recipe, they “snap back” to each other at the global level, a phenomenon the authors call Manifold Realignment.
•Supervised Fine-Tuning (SFT) is very different: it actually reshapes the global map, mixes features more, and can hurt transfer of simple linear readouts.
•This means better “thinking” (RLVR reasoning) does not automatically mean better “grouping” (embeddings), because RLVR optimizes how the model moves through an existing map instead of redrawing the map.
•Practically, if you want better embeddings, your contrastive training recipe matters far more than whether the backbone was RLVR-tuned for reasoning.
•The HRSA toolkit (CKA, k-NN overlap, Procrustes, cross-model probes) is useful beyond text and could diagnose representation changes in vision or audio models, too.

Why This Research Matters

This work tells practitioners they don’t get better embeddings automatically by starting from a reasoning-tuned model. That can save large amounts of compute and time by focusing effort on the contrastive training recipe, data curation, and negative mining. Teams building search, recommendation, and retrieval systems can select cheaper base backbones without losing quality. HRSA offers a practical diagnostic kit to understand how training changes the hidden map inside models, guiding better choices across modalities. The idea of Manifold Realignment also hints that downstream objectives can override some pretraining differences, helping unify model families during deployment.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how you can be great at solving puzzles but not necessarily great at organizing your bookshelf? Solving and sorting are different skills.

🥬 The Concept (Embeddings):

What it is: A text embedding is a vector that places a sentence like “Cats purr” somewhere in a big space so similar ideas land close together.
How it works: 1) Read the sentence, 2) turn each word into numbers, 3) mix the signals across layers, 4) pool into one vector, 5) use distances to compare meaning.
Why it matters: Without embeddings, search engines, chat memory, and recommendation systems wouldn’t know that “doctor” is closer to “hospital” than to “volcano.”

🍞 Anchor: When you search for “best way to fix a bike chain,” good embeddings help fetch guides about bike repair, not recipes for donuts.

The world before this paper: People built strong embeddings by adapting large language models (LLMs) and then training them with contrastive learning so that matching pairs (like a question and its answer) pull together and non-matching pairs push apart. At the same time, new “reasoning” models were created using Reinforcement Learning with Verifiable Rewards (RLVR). These models got much better at step-by-step thinking, math chains-of-thought, and careful deliberation.

🍞 Hook: Imagine a student earns lots of gold stars for correct step-by-step solutions. You might expect they’ll also file their notes better.

🥬 The Concept (RLVR):

What it is: RLVR is a training method that rewards a model for verifiably correct outcomes (like correct math answers), not just for sounding nice.
How it works: 1) Model attempts a solution, 2) a checker verifies correctness, 3) the model gets a reward or penalty, 4) weights are nudged to increase future correct attempts.
Why it matters: Without RLVR, models may learn to “sound confident” rather than to “be correct.”

🍞 Anchor: It’s like practicing math with an answer key: you only get points when the final answer is truly right.

This led to a big question: If a model becomes a better thinker (thanks to RLVR), will that also give us better embeddings? It feels intuitive—better reasoning should mean better mapping of meaning, right?

🍞 Hook: Suppose you have a treasure map of an island (the “latent manifold”), and you learn to hike better. Do you also get a better map?

🥬 The Concept (Latent Manifold):

What it is: The latent manifold is the “shape” formed by all embeddings in high-dimensional space.
How it works: 1) Turn many texts into vectors, 2) these vectors form clouds and clusters, 3) the arrangement captures relationships among ideas, 4) the shape guides what’s close or far.
Why it matters: If the manifold is sloppy, similar meanings won’t end up near each other, hurting retrieval and clustering.

🍞 Anchor: Think of a subway map: stations with similar functions (downtown hubs) cluster; the map’s shape guides how you travel.

Before this paper, people tried two main routes to reasoning: SFT (Supervised Fine-Tuning) on labeled reasoning data and RLVR. SFT can strongly change what the model focuses on, sometimes bending the manifold in big ways. RLVR, anchored by rewards and stability constraints, tends to keep the original model’s knowledge but polishes how it chooses steps.

🍞 Hook: Imagine two study strategies: one rewrites your whole notebook (SFT), the other teaches you better test-taking routes through what you already know (RLVR).

🥬 The Concept (SFT):

What it is: SFT trains a model on examples with the “right” answers so it imitates them.
How it works: 1) Show input+label pairs, 2) penalize wrong outputs, 3) repeat until outputs match labels.
Why it matters: Without SFT, a model might never specialize or learn desired task styles.

🍞 Anchor: It’s like practicing with a tutor who says, “Copy this exact method and answer.”

🥬 The Concept (Contrastive Learning):

What it is: Contrastive learning teaches the model to pull matching items together and push mismatched ones apart in embedding space.
How it works: 1) Pick a “query” and its “positive” match, 2) add “negatives,” 3) increase similarity of the positive pair, 4) decrease similarity to negatives.
Why it matters: Without it, embeddings won’t reliably reflect “who belongs with whom.”

🍞 Anchor: It’s like sorting socks: you pair matching socks tightly and keep different socks apart.

The problem: Despite stronger reasoning, do RLVR-tuned backbones actually improve embeddings? The authors tested this carefully by training embedding models from matched base vs. RLVR-tuned backbones using the exact same contrastive recipe and then comparing performances on MTEB (Multilingual and Code) and BRIGHT (reasoning-heavy retrieval).

Failed attempts and intuition clashes: Many expected RLVR to help embeddings. But the experiments showed a “null effect”—no consistent gains over base backbones when both are trained equally as embedding encoders.

The gap: Benchmark scores alone can’t explain why the results are the same. Are the internal representations truly identical, or do they differ in ways the scores can’t see?

Real stakes: This affects how we build efficient systems. If RLVR doesn’t boost embeddings, teams can save compute by starting from base models for encoders. It also nudges researchers to focus on the embedding training recipe (data, negatives, pooling) rather than expecting reasoning tweaks to carry over automatically.

02Core Idea

🍞 Hook: You know how two kids can take different paths through a museum but end up seeing the same rooms? The route changes, but the building stays the same.

🥬 The Concept (Key Insight):

What it is: The big idea is that RLVR improves how a model travels through its knowledge but doesn’t redraw the overall map that embeddings rely on.
How it works: 1) RLVR preserves the global shape of the latent manifold, 2) it reorganizes local neighborhoods (who’s next to whom), 3) prolonged RLVR can rotate coordinates a bit, 4) when you later do contrastive learning for embeddings, both base and RLVR-initialized models “snap” to a similar global arrangement.
Why it matters: If the global shape stays the same, embedding quality won’t jump just because reasoning improved.

🍞 Anchor: Two hikers choose different trails but still tour the same island. Their trail skills got better, but the island’s map didn’t change.

Multiple analogies for the same idea:

City Map vs. Driving Style: RLVR is like improving your driving—smoother turns and better merging—while the city map (global geometry) stays fixed. Embeddings care about the map, not your driving flair.
Library Shelves vs. Reading Strategy: You learn a sharper reading strategy (RLVR), but the library shelves (manifold) mostly keep their order. Embedding tasks depend on shelf order.
Puzzle Table vs. How You Reach Pieces: RLVR helps you reach the right puzzle pieces more efficiently, but the pieces’ positions (global geometry) barely change.

🍞 Hook: Imagine a three-level microscope that compares two models: close-up details, overall shape, and what they can do.

🥬 The Concept (HRSA):

What it is: HRSA (Hierarchical Representation Similarity Analysis) compares models at three levels—representation, geometry, and function.
How it works: 1) Representation: do features line up along the same axes? 2) Geometry: is the overall shape of points the same (globally and locally)? 3) Function: do the same simple readouts still work?
Why it matters: Benchmarks can hide differences. HRSA reveals what changed and what didn’t.

🍞 Anchor: It’s like checking if two songs match by 1) the instruments (representation), 2) the melody shape (geometry), and 3) whether people can dance the same steps to both (function).

Within HRSA:

🍞 Hook: Picture two coordinate grids drawn over the same map.

🥬 The Concept (Representation Level):

What it is: Checks whether two models use similar axes (coordinate basis) for their features.
How it works: 1) Compare features dimension-by-dimension, 2) allow a single rotation (Procrustes) to see if axes can be matched, 3) measure how “one-to-one” the mapping is.
Why it matters: If axes drift a lot, features are mixed, which can complicate transfer.

🍞 Anchor: Two math notebooks might describe the same triangle but label sides differently; representation-level asks if labels still align.

🍞 Hook: Now zoom out to look at the whole shape of the data cloud.

🥬 The Concept (Geometry Level—Global vs. Local):

What it is: Global geometry is the big-picture shape; local geometry is each point’s nearest neighbors.
How it works: 1) Global shape checked by CKA (are pairwise relations preserved up to rotation/scale?), 2) local checked by k-NN overlap (do nearest neighbors stay the same?).
Why it matters: Even if the big shape stays similar, small neighbor swaps can change path planning or local decisions.

🍞 Anchor: Your city might keep the same districts (global), but some side streets change direction (local).

🍞 Hook: Finally, ask what tasks you can do with a simple tool.

🥬 The Concept (Function Level):

What it is: Tests whether a simple linear readout learned on one model still works on the other.
How it works: 1) Train a linear probe on Model A, 2) freeze it, 3) apply it to Model B, 4) compare performance to A’s own.
Why it matters: If the same readout works, the models are functionally aligned for that task.

🍞 Anchor: It’s like teaching one vending machine a coin detector and seeing if the same detector also works on the other machine.

What changes because of this idea:

Before: Many believed better reasoning would bring better embeddings “for free.”
After: We learn that embeddings don’t automatically improve because the global manifold—the part embeddings rely on—stays much the same under RLVR.

Why it works (intuition):

RLVR’s rewards guide the model to plan better trajectories without tearing up the map. The KL-like anchors and on-policy constraints tend to keep the model near a safe region of its original representation space.
Contrastive learning is strong at “snapping” manifolds into a shared alignment when the training data and recipe are the same.

🍞 Hook: Imagine two slightly rotated puzzle boards that become aligned when you press them into the same frame.

🥬 The Concept (Manifold Realignment):

What it is: When you train embeddings with the same contrastive recipe, base-initialized and RLVR-initialized models become globally aligned again.
How it works: 1) Start with two backbones, 2) remove LM head, 3) pool activations, 4) train with identical contrastive data, 5) see the global shapes converge.
Why it matters: It explains the “null effect” on benchmarks: after alignment, both encoders perform the same.

🍞 Anchor: Two choirs that practiced different warmups still sing in tune together once they follow the same conductor and sheet music.

03Methodology

At a high level: Text inputs → Backbone LLM (Base or RLVR-tuned) → Remove LM head + pooling → Contrastive learning (InfoNCE) → Evaluate on benchmarks + Analyze with HRSA (representation, geometry, function).

Step-by-step recipe:

Choose matched backbones

What happens: Pick a base LLM and its RLVR-tuned version (same family, RLVR applied directly to the base). Include SFT pairs for contrast.
Why this step exists: A fair A/B test requires only one difference—the training (base vs. RLVR/SFT)—so we can isolate its effect on embeddings.
Example: Qwen2.5-1.5B (base) vs. Qwen2.5-1.5B-SimpleRL-Zoo (RLVR), plus SFT controls.

Turn LLMs into embedding encoders

What happens: Remove the language modeling head, enable bidirectional attention, take final-layer hidden states, and mean-pool into a fixed-size vector.
Why this step exists: We want a pure encoder behavior that outputs one vector per text for retrieval/clustering tasks.
Example: A question and a passage get pooled into 1,536-d vectors you can compare with cosine similarity.

🍞 Hook: Imagine training by pairing matching socks and splitting up mismatched ones.

🥬 The Concept (InfoNCE—Contrastive Objective):

What it is: A loss that pulls a query close to its true match and pushes it away from negatives.
How it works: 1) For each query, mark a positive passage, 2) add hard negatives, 3) increase query–positive similarity, 4) decrease query–negative similarity, 5) repeat over many batches.
Why it matters: Without InfoNCE, embeddings won’t consistently reflect semantic closeness.

🍞 Anchor: It’s like coaching: “these two go together; keep those apart,” over and over until the model gets it.

Train with a strong, identical recipe

What happens: Use the same data, mining, optimizer, and hyperparameters for both initializations; no LoRA, full-parameter updates.
Why this step exists: Keeps the playing field level so any performance change comes from the backbone, not the recipe.
Example: 1.6M training pairs from 11 datasets; mined hard negatives with a 95% positive-aware margin; temperature 0.02; batch size 2048; cosine LR schedule.

Evaluate embedding performance

What happens: Test on MTEB (Multilingual v2, Code v1) and BRIGHT for retrieval, clustering, and similarity tasks, including reasoning-heavy retrieval.
Why this step exists: These broad benchmarks capture general embedding quality and domain-specific performance.
Example: Report means over 3 seeds; compare deltas base-Emb vs. RLVR-Emb.

Analyze internals with HRSA

What happens: Collect token-level, per-layer representations for both models and compare them with representation-, geometry-, and function-level tools.
Why this step exists: Identical scores can hide very different internal changes; HRSA reveals what’s preserved and what’s reorganized.
Example: Compute correlation heatmaps across layers, CKA matrices, k-NN overlap across layers, and cross-model probe transfer.

Now the “tools” used inside HRSA:

🍞 Hook: Think of checking if two coordinate grids line up.

🥬 The Concept (Orthogonal Procrustes):

What it is: A way to find the best rotation to align one feature space with another.
How it works: 1) Solve for the single best rotation between spaces, 2) measure how close that rotation is to a simple permutation (axes match one-to-one) using inverse row entropy, 3) higher means more axis-aligned.
Why it matters: If a single rotation nearly matches features, then the coordinate basis is preserved.

🍞 Anchor: It’s like turning a transparency sheet until street names line up over a map.

🍞 Hook: Peek at each feature column to see if it finds a twin.

🥬 The Concept (Dimension-wise Correlation):

What it is: Compares each feature dimension in Model A to the same dimension in Model B.
How it works: 1) For column j, compute correlation of A_j vs. B_j across tokens, 2) average over j.
Why it matters: High scores mean many features already match one-to-one without mixing.

🍞 Anchor: It’s like checking if notebook line #7 in both copies says the same thing.

🍞 Hook: Step back and check if two point clouds have the same overall shape.

🥬 The Concept (Linear CKA):

What it is: A score that says whether two sets of representations share a similar global geometry (up to rotation/scale).
How it works: 1) Build pairwise similarity matrices (Gram matrices), 2) center them, 3) measure how aligned they are, 4) higher means more similar global shape.
Why it matters: If CKA stays high, the big map is stable even if axes rotate.

🍞 Anchor: Two constellations look the same even if you slightly rotate the night sky.

🍞 Hook: Now check the nearest friends around each point.

🥬 The Concept (k-NN Overlap):

What it is: A way to see if each vector keeps the same closest neighbors across models.
How it works: 1) Find top-k neighbors by cosine similarity in Model A and B, 2) compute the Jaccard overlap, 3) average across points.
Why it matters: Neighbor changes = local geometry reorganization.

🍞 Anchor: You moved houses but kept the same city; do you still live next to the same neighbors?

🍞 Hook: Finally, test whether the same simple detector works on both models.

🥬 The Concept (Cross-Model Linear Probe):

What it is: Train a linear classifier on Model A’s embeddings and test it—unchanged—on Model B.
How it works: 1) Fit probe on A for a task (e.g., news topics), 2) apply it to B’s embeddings, 3) compare accuracy drop.
Why it matters: If the drop is small, both models support the same easy-to-access functions.

🍞 Anchor: A key cut for one lock that also opens the other indicates similar internals.

The secret sauce:

The identical contrastive recipe and broad evaluation remove confounders.
HRSA’s three lenses pinpoint why scores match: RLVR keeps global geometry and linear readouts stable, but reorganizes local neighborhoods; contrastive learning then realigns global structure across initializations.

04Experiments & Results

The test: The authors trained pairs of embedding models that differed only in the backbone initialization: base vs. RLVR-tuned (and SFT as a contrasting control). They used the same architecture, pooling, contrastive objective, data, and hyperparameters for both. Performance was measured on MTEB Multilingual v2, MTEB Code v1, and BRIGHT. Then HRSA examined internal similarities across layers.

The competition: Baselines were base-initialized embeddings; comparators were RLVR-initialized embeddings using zero-RL style direct tuning (no SFT warm-start). SFT-initialized embeddings served as a control to illustrate a very different footprint (often altering global geometry more).

Scoreboard with context:

Embedding quality parity: RLVR-initialized embeddings performed essentially the same as base-initialized ones—performance gaps hovered near zero with tiny standard deviations. Think: two students both scoring 87% while everyone expected the RLVR student to hit 95%.
SFT differences: In contrast, SFT backbones showed larger shifts (sometimes negative) indicating more aggressive manifold restructuring.

Surprising findings and what HRSA revealed:

Representation level:

Dimension-wise correlation: RLVR pairs showed stronger axis-aligned feature correspondence than SFT pairs. Prolonged RLVR could drift axes somewhat, but contrastive training largely restored axis alignment between the resulting embedding models.
Orthogonal Procrustes: For RLVR pairs, the optimal alignment matrix was close to a permutation (high inverse row entropy), meaning features stayed fairly one-to-one. After contrastive learning, this became even more permutative. SFT pairs, however, needed dense rotations, showing heavy feature mixing.

Geometry level:

Linear CKA (global): Stayed high for RLVR pairs but dropped for SFT pairs, indicating RLVR preserves the global manifold shape. After contrastive training, base-Emb and RLVR-Emb moved even closer in CKA—evidence of Manifold Realignment.
k-NN overlap (local): RLVR preserved local neighborhoods far better than SFT, but overlap was still well below 1. That is, RLVR reshuffled some local neighbors. Notably, contrastive training did not fully undo these local changes; local reorganization remained partly irreversible even as global shapes realigned.

Function level:

Cross-model linear probes: Probes trained on one RLVR model transferred better to its pair than SFT probes did, reflecting more stable linear readout directions. For the embedding models, transfer stayed high, confirming functional alignment after contrastive training.

Training dynamics:

Early “snap”: During embedding training, manifold realignment happened quickly (early steps), then stabilized. Meanwhile, k-NN overlap did not fully recover, reinforcing that local reorganization from RLVR is persistent even when the global shape re-aligns.

Concrete numbers in plain words:

Across the RLVR pairs on MTEB and BRIGHT, mean differences were tiny (e.g., −0.26 ± 0.08 to +0.18 ± 0.04), basically a tie given run-to-run noise.
SFT comparisons showed bigger shifts and lower CKA, consistent with more destructive or at least more invasive changes to the representation map.

Bottom line results:

No free lunch from reasoning: Being better at step-by-step thinking (via RLVR) doesn’t automatically yield better embeddings.
RLVR as trajectory optimizer: It tweaks how the model navigates an existing semantic landscape without remapping the terrain.
Contrastive training is the great aligner: Given the same data and loss, embedding models converge to similar global geometries regardless of RLVR initialization.

05Discussion & Limitations

Limitations:

Scope of models and data: Although the study spans several model families, RLVR methods (GRPO, DAPO, PPO-like variants), and big benchmarks (MTEB, BRIGHT), it can’t cover every scale, domain, or training schedule. There may be niche settings (e.g., extremely domain-tailored RLVR) where embeddings benefit.
Metric sensitivity: HRSA uses principled but still imperfect tools (CKA can be gamed in high dimensions; k-NN depends on k and similarity choice). Conclusions rely on triangulating across multiple metrics rather than trusting a single score.
Task selection: Function-level probes used standard classification; other downstream tasks (e.g., few-shot retrieval with re-ranking or cross-modal fusion) might surface subtler differences.

Required resources:

Compute: Full-parameter contrastive training at large batch sizes (e.g., 2048) and token-level HRSA across layers require multi-GPU setups and careful engineering (mixed precision, Flash Attention, checkpointing).
Data curation: High-quality positive/negative pairs and hard-negative mining strongly affect embedding quality and are essential to replicate the findings.

When not to use RLVR as an embedding booster:

If your only goal is better embeddings, applying RLVR first is unlikely to help and adds cost; invest that compute in better contrastive data, mining, and training.
If your application hinges on global manifold reshaping (e.g., domain remapping), SFT or other methods may be more appropriate than RLVR.

Open questions:

Mechanism of local reorganization: Why does RLVR consistently reshuffle local neighborhoods while keeping global shape stable? Which parts of the reward, KL constraints, or curricula control this?
Tipping points for basis drift: What training lengths or reward mixes trigger coordinate basis drift, and how reversible is it under different downstream objectives?
Beyond text: Does manifold realignment appear in vision or audio encoders adapted from RLVR-tuned backbones? HRSA provides a path to check.
Constrained SFT: Can we design SFT with geometry- and basis-aware regularizers to mimic RLVR’s “optimize trajectories without redrawing maps” behavior?

06Conclusion & Future Work

Three-sentence summary: This paper shows that making language models better at reasoning using RLVR does not automatically improve the quality of their text embeddings. Using a new three-level analysis tool (HRSA), the authors find RLVR preserves the global map of meanings while reshuffling some local neighborhoods and, under long training, slightly rotating feature axes. When both base and RLVR backbones are trained as embedding encoders with the same contrastive recipe, they realign globally and perform the same.

Main achievement: Introducing HRSA and using it to uncover Manifold Realignment, which explains the “null effect” by showing RLVR optimizes how models travel through an existing semantic landscape rather than redrawing it.

Future directions: Explore geometry- and basis-aware regularizations for SFT to reproduce RLVR’s stable-global/adjusted-local footprint, probe the exact training signals that control local reorganization and coordinate drift, and test HRSA and realignment phenomena in vision and audio.

Why remember this: It resets expectations—better “thinking” isn’t the same as better “grouping.” If you want stronger embeddings, focus on the contrastive recipe, data quality, and negatives. HRSA gives the community a clear, reusable lens to diagnose how training changes the hidden map inside models.

Practical Applications

•Choose base backbones for new embedding encoders when budgets are tight—skip RLVR unless you need reasoning at inference.
•Invest in better contrastive datasets and hard-negative mining rather than reasoning post-training to lift embedding quality.
•Use HRSA to audit model changes after fine-tuning, checking whether a method preserves global geometry or mixes features.
•Prefer contrastive retraining as a unifier step when merging systems built from slightly different backbones.
•Apply cross-model linear probes to test compatibility between old and new encoders before swapping in production.
•Use k-NN overlap monitoring to detect local geometry shifts that might affect nearest-neighbor search stability.
•If you must use SFT, consider geometry-aware regularizers to avoid destructive global reshaping of the manifold.
•Extend HRSA to vision/audio encoders to debug why two training recipes yield similar or different retrieval results.
•Track training dynamics early (first 200 steps) to confirm expected manifold realignment and stop training sooner.
•Design ablations that vary only initialization (base vs. RLVR) to estimate expected returns on embedding tasks.

Version: 1