Latent Adversarial Regularization for Offline Preference Optimization

Enyi Jiang; Yibo Jacky Zhang; Yinglun Xu; Andreas Haupt; Nancy Amato; Sanmi Koyejo

Latent Adversarial Regularization for Offline Preference Optimization

Intermediate

Enyi Jiang, Yibo Jacky Zhang, Yinglun Xu et al.1/29/2026

arXiv PDF

Key Summary

•This paper introduces GANPO, a new way to train language models from human preferences by guiding the model using its hidden thoughts (latent space) instead of just its visible words (token space).
•GANPO adds a small helper model (a discriminator) that learns to tell whether the model’s hidden thoughts look like a trusted reference model’s hidden thoughts.
•The main trick is adversarial training in latent space, which gently pushes the model to keep a good internal structure while still learning what people prefer.
•Compared to standard methods like DPO and SimPO, GANPO improves win rates on AlpacaEval-2.0 without making answers longer or more verbose.
•GANPO stays strong when sampling becomes noisy (high temperature), keeping instructions and structure intact better than token-only training.
•Two discriminators are used: one for ‘good’ answers and one for ‘bad’ answers, so the model learns both what to do and what to avoid.
•On downstream tasks (math, knowledge, reasoning, factuality), GANPO matches or slightly improves performance, showing it doesn’t overfit to the preference data.
•The extra compute is modest because the reference model is already used, and the discriminators are lightweight.
•GANPO’s latent regularization acts like a geometry-preserving guardrail, reducing reward hacking and length bias that can happen with token-only methods.

Why This Research Matters

In everyday use, we want chatbots that stay clear, accurate, and polite even when asked tricky or creative questions. GANPO helps by shaping the model’s hidden thinking, not just its surface words, so it keeps a solid plan and structure inside its head. This reduces brittleness when we sample with higher randomness, which is important for brainstorming or diverse content creation. It also curbs reward hacking and length bias, making answers genuinely better rather than just longer. Because GANPO adds only modest compute and plugs into existing preference pipelines, it’s practical for real systems. In the long run, this kind of structure-aware training can improve formatting, safety, and reliability across many applications.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how two essays can use very different words but still mean the same thing? Like “Hi there!” and “Good morning to you!” They look different on the surface, but their meanings are close.

🥬 The Concept: Preference Optimization (What it is) is how we train language models to choose responses that people like, using pairs of examples: one better (chosen) and one worse (rejected). How it works (recipe): 1) Collect lots of pairs from people saying which answer they prefer. 2) Show the model both answers for the same prompt. 3) Nudge the model toward the better answer and away from the worse one. 4) Keep it from drifting too far from a trusted reference model. Why it matters: Without it, models can say things people don’t want—too long, off-topic, or unsafe.

🍞 Bottom Bread (Anchor): Imagine a teacher holding up two drawings from the same student: one follows the instructions better. The teacher says, “Pick this one next time.” That’s preference optimization.

🍞 Top Bread (Hook): Imagine judging a song only by the letters in its lyrics, not by the tune. You’d miss the real feel of the music.

🥬 The Concept: Token-Space Regularization with KL Divergence (What it is) is a common guardrail that keeps the model’s word choices close to a reference model’s word choices. How it works: 1) Compare the model’s next-word probabilities to the reference model’s. 2) Measure the difference (KL divergence). 3) Add a penalty if the model drifts too far. 4) Balance that penalty with learning people’s preferences. Why it matters: Without a guardrail, the model can chase short-term points (“reward hacking”), like getting longer and longer, instead of genuinely better. But token-level closeness sometimes mistakes different words with the same meaning as “far apart.”

🍞 Bottom Bread (Anchor): “Hi there” vs. “Good morning to you” looks far apart in tokens, but close in meaning—yet KL only sees the word difference.

🍞 Top Bread (Hook): Think about how your brain has hidden thoughts while writing—ideas and plans—before you pick exact words.

🥬 The Concept: Latent Space (What it is) is the model’s hidden representation of meaning and structure, like its internal “thought cloud.” How it works: 1) The model reads a prompt and builds hidden states layer by layer. 2) These hidden states capture meaning, style, and plan. 3) The final hidden layer summarizes what to say next. Why it matters: If we guide the hidden thoughts, we can shape meaning and structure directly, not only the surface words.

🍞 Bottom Bread (Anchor): It’s like checking a student’s outline, not just the final essay—it shows whether their thinking is on track.

🍞 Top Bread (Hook): Suppose two art teachers: one looks only at brushstrokes (tokens); the other looks at the sketch underneath (latent plan). Which one better preserves the artist’s intent?

🥬 The Concept: Latent-Space Regularization (What it is) adds a penalty when the policy model’s hidden thoughts drift from a well-behaved reference model’s hidden thoughts. How it works: 1) Take the hidden states from both models for the same prompt+answer. 2) Measure how different these hidden states’ distributions are. 3) Nudge the policy to keep similar internal structure. Why it matters: Without it, the model may look good on the surface but lose stable structure, especially under noise (high temperature sampling).

🍞 Bottom Bread (Anchor): If a student’s outline stays clear and similar to a trusted template, their final essay stays organized even when they try creative wordings.

🍞 Top Bread (Hook): Picture a friendly contest: a forger makes paintings, and a judge tries to tell if they match the museum’s style. Both get better together.

🥬 The Concept: Generative Adversarial Networks (GANs) (What it is) are a pair of models—a generator and a discriminator—that push each other to improve. How it works: 1) The generator creates samples. 2) The discriminator tries to tell them apart from real samples. 3) The generator learns to fool the discriminator. 4) Over time, the generator’s samples become more realistic. Why it matters: This setup can compare distributions without needing exact probability formulas.

🍞 Bottom Bread (Anchor): The forger keeps improving until the judge can’t tell the difference—that’s the generator learning a real style.

🍞 Top Bread (Hook): Imagine a referee who doesn’t just say “real or fake,” but “is this better than the average of the other team?”

🥬 The Concept: Relativistic Average Divergence (What it is) is a stable way to train the discriminator so it judges samples relative to the average of the other side. How it works: 1) Compute a score for a sample. 2) Compare it to the average score from the other distribution. 3) Use this relative difference to train. 4) This makes training smoother and less jumpy. Why it matters: Stability matters; otherwise adversarial training can wobble or collapse.

🍞 Bottom Bread (Anchor): It’s like ranking a runner compared to the field’s average, not just absolute time on one race day—less noisy, more stable.

🍞 Top Bread (Hook): What if we could teach a model with both taste tests and shape guides—so it knows what people like and also how to keep its inner structure tidy?

🥬 The Concept: GANPO (What it is) is a plug-in regularizer for offline preference optimization that uses a GAN-style discriminator in latent space to keep internal structure aligned with a reference model while learning preferences. How it works: 1) Train with preference pairs (chosen vs. rejected). 2) Extract hidden states from both the policy and the reference. 3) Two discriminators learn to tell “good” and “bad” hidden patterns apart. 4) The policy is nudged to match the reference’s good structure and avoid its bad structure, while still optimizing the usual preference loss (like DPO/SimPO). Why it matters: Without this, models may be brittle under noise, over-verbose, or learn surface tricks; GANPO preserves deep structure and robustness.

🍞 Bottom Bread (Anchor): It’s like learning to cook by tasting (preferences) and also keeping your kitchen workflow organized (latent structure), so dishes stay great even on a busy night.

02Core Idea

🍞 Top Bread (Hook): Imagine teaching a dancer: you don’t just score their final pose; you also guide their body balance throughout the routine so the ending is graceful no matter the music speed.

🥬 The Concept: The “Aha!” Moment (What it is) is to regularize in the model’s hidden thought space, not only in word space, by playing a friendly adversarial game that keeps the policy’s internal structure close to a stable reference while learning human preferences. How it works: 1) Keep the usual offline preference loss (like DPO/SimPO). 2) Add two discriminators that look at hidden states: one for good responses, one for bad. 3) Train them to tell reference vs. policy hidden patterns apart (in a relativistic way). 4) Nudge the policy to fool them—so its hidden states keep the good structure and avoid the bad. Why it matters: Words can change a lot while meaning stays the same; guiding hidden structure keeps meaning and reasoning steady, especially under noise.

🍞 Bottom Bread (Anchor): A singer coached on breath control (latent) performs better across all songs, not just one lyric sheet (tokens).

Multiple analogies for the same idea:

Map analogy: Token-level rules are like checking street names only; latent regularization is also checking the map’s shape so routes stay sensible even when roads are renamed.
Sports analogy: Token-only training drills a specific play; latent regularization builds core strength and balance, so the athlete performs well even in messy, fast games.
Baking analogy: Token checks are the icing pattern; latent checks are the sponge’s texture. Good texture makes any decoration taste better.

Before vs. After:

Before: Preference methods used token-level closeness (KL) as a guardrail, which can confuse different words with different meanings and encourage length tricks.
After: GANPO adds a latent-space guardrail that preserves the model’s internal geometry, making it robust to sampling noise and less likely to overfit to surface patterns.

Why it works (intuition, no equations):

Distributions of hidden states capture semantics and structure. If we match those distributions between a policy and a reliable reference, the policy keeps a healthy internal manifold (its “thinking space”).
A discriminator can learn the differences between these manifolds without us writing an explicit probability formula. The policy learns to produce hidden states that the discriminator can’t separate from the reference’s—so the manifolds align.
Using a relativistic objective (compare to the other side’s average) stabilizes this game, avoiding oscillations.
Splitting into two discriminators (good vs. bad) gives clearer, denser signals: imitate how the reference thinks for good answers; avoid how it thinks for bad ones.

Building blocks:

Offline Preference Optimization (DPO/SimPO): the taste-test core loss from human preferences.
Latent Representations: the internal summaries of meaning and reasoning.
Two Discriminators: one focuses on the “good” manifold, one on the “bad” manifold.
Relativistic Training: judge samples relative to the other side’s average for better stability.
Plug-and-Play: add the adversarial latent regularizer to any OPO pipeline with modest overhead.

🍞 Bottom Bread (Anchor): Think of a GPS that doesn’t just follow street signs (tokens) but also keeps the overall map shape (latent) aligned with a trusted atlas, so it stays reliable even if you take a scenic, twisty route (high temperature sampling).

03Methodology

High-level overview: Input (preference pairs) → Extract hidden states from policy and reference → Train two discriminators on hidden states (relativistic, good/bad) → Compute adversarial latent loss → Combine with standard OPO loss (e.g., DPO/SimPO) → Update policy.

Step-by-step like a recipe:

Prepare the ingredients (data and models)

What happens: Load a preference dataset with (prompt, chosen, rejected). Choose a policy model to train and a frozen reference model. Initialize two discriminators (for good and bad) that read hidden states.
Why it exists: We need pairs to learn preferences, a stable anchor (reference), and critics (discriminators) to shape hidden structure.
Example: Prompt: “Write a polite email to reschedule.” Chosen reply is concise and kind; rejected reply is curt and unclear.

Extract latent features (hidden states)

What happens: Run both policy and reference on (prompt + chosen) and (prompt + rejected). Grab the last-layer hidden states—these are the latent representations.
Why it exists: Hidden states capture meaning and plan; they’re the right place to align structure.
Example: For the chosen reply, the reference’s hidden state h_ref+ shows polite structure; the policy’s h_θ+ should move toward that. For the rejected reply, h_ref− represents structure to avoid; h_θ− should move away.

Train the positive discriminator (good manifold)

What happens: The positive discriminator sees pairs like (reference good vs. policy good) and (policy good vs. reference bad). It learns to rank reference good highest, policy good next, and reference bad lower.
Why it exists: It teaches the policy what “good structure” looks like in latent space, beyond surface words.
Example: It learns that hidden patterns for courteous opening → clear reason → exact new time → kind closing are better than patterns missing these steps.

Train the negative discriminator (bad manifold)

What happens: The negative discriminator sees (reference bad vs. policy bad) and (policy bad vs. reference good). It learns to push policy bad below reference bad and recognize the gap to reference good.
Why it exists: It shows what to avoid and sharpens the contrast, preventing the policy from drifting into sloppy patterns.
Example: It flags patterns like abrupt tone → missing apology → vague timing as structurally poor.

Use a relativistic average trick for stability

What happens: Each discriminator compares a sample’s score to the other side’s average score, not in isolation.
Why it exists: Relative judgments reduce noise and training wobble, making learning smoother.
Example: Instead of “Is this sample good?” it’s “Is this better than the average fake/real sample?”—a steadier ruler.

Compute the adversarial latent loss for the policy

What happens: Freeze discriminators for this step. The policy gets a loss that says “make your good hidden states look like the reference’s good ones, and keep your bad hidden states unlike the reference’s bad ones.”
Why it exists: This is the nudge that aligns internal structure while the main preference loss teaches taste.
Example: The policy moves its hidden states so replies keep a polite blueprint even if words vary.

Combine with the usual preference loss and update

What happens: Compute the normal OPO loss (e.g., DPO/SimPO) plus λ times the adversarial latent loss. Update the policy. Alternately update discriminators and policy each step.
Why it exists: We want both taste (preferences) and structure (latent geometry). The λ weight balances them.
Example: If λ is too high, the model might cling too tightly to reference style; too low, and structure weakens.

Practical choices that matter (secret sauce)

Quad of hidden states: Use (ref good, ref bad, policy good, policy bad). This fully uses pairwise data and makes signals richer.
Two discriminators: One for good, one for bad. This separates distinct manifolds cleanly.
Reference as “real”: Anchoring to the reference model keeps overlap between distributions, giving discriminators meaningful work and stable gradients.
Transformer discriminator: A shallow Transformer reads sequence-level structure better than simple MLPs, boosting win rates.
Modest overhead: The reference is already in OPO; discriminators are light; no online rollouts are needed.

Concrete mini example with actual data flow:

Input: (x = “Explain photosynthesis simply.”, y_w = clear, correct explanation; y_l = confusing, off-topic).
Hidden states: Get last-layer states from policy and reference for (x, y_w) and (x, y_l). Now you have h_ref+, h_ref−, h_θ+, h_θ−.
Train D_pos: Push scores so h_ref+ > h_θ+, and h_θ+ > h_ref−.
Train D_neg: Push scores so h_ref− > h_θ−, and h_θ− < h_ref+.
Policy step: Minimize OPO loss + λ·(adversarial latent loss), so h_θ+ drifts toward h_ref+’s structure and h_θ− avoids h_ref−’s structure.
Outcome: The policy learns to keep a correct, simple explanatory blueprint even with varied wording or higher temperature sampling.

What breaks without each step:

No reference anchor: Discriminators saturate or chase style quirks of a too-strong teacher, causing instability.
One discriminator only: Good and bad manifolds muddle together; signals get weaker.
No relativistic averaging: Training can oscillate or collapse.
Token-only regularization: Model may look fine greedily but collapse under noise, get verbose, or hack superficial patterns.

04Experiments & Results

The test: The authors measure how well models follow preferences and keep structure under different conditions. They use AlpacaEval-2.0 for open-ended quality, IFEval for strict instruction following, and several downstream tasks (GSM8K, MMLU, ANLI, TruthfulQA) to check general skills. They also test robustness under higher sampling temperatures (more randomness), where weak structure tends to break.

The competition: Baselines are DPO and SimPO—popular, efficient offline preference methods that regularize in token space. GANPO plugs into both without changing their main loss, adding latent adversarial regularization.

The scoreboard with context:

AlpacaEval-2.0: On Gemma2-2B-it and Llama-3-8B-Instruct, adding GANPO raises both win rate and length-controlled win rate by around 1–2 percentage points. Think of this as going from a solid B+ to an A- while keeping answer length similar—no cheating by just writing more.
Robustness to noise (temperature tests): At higher temperatures (T ≥ 1.0), GANPO’s advantage widens. Where DPO’s performance drops, GANPO holds structure better, like a building with stronger framing in a windstorm.
IFEval strict accuracy: As temperature rises, DPO’s exact-format adherence falls sharply, but GANPO stays steadier. That means instruction constraints (like JSON formats or step-by-step structures) survive better under randomness.
Discriminator vs. learned reward model: Under very noisy generation (T = 1.5–2.0), a separately trained reward model’s correlation with a strong oracle collapses or inverts (reward hacking). The GANPO discriminator keeps a solid positive correlation, signaling it learned true structure, not brittle token tricks.
Downstream tasks: GANPO matches or slightly improves math (GSM8K), reasoning (ANLI), and factuality (TruthfulQA), and stays comparable on knowledge (MMLU). This shows it doesn’t overfit to the preference dataset at the expense of other skills.

Surprising findings:

The discriminator’s signal remains robust out-of-distribution, while a traditional reward model can be hacked at high entropy. This suggests latent, sequence-level feedback gives sturdier guidance than token-only or learned reward signals in noisy regimes.
A lightweight Transformer discriminator beats simpler MLP or fixed critics, directly improving win rates—so sequence-aware structure really helps.

Compute costs in context:

Training time increases modestly (minutes, not hours) compared to the base OPO runs, using the same GPUs. Since the reference is already part of OPO and the discriminators are small, the added overhead is practical.

Takeaway: In plain terms, GANPO is like reinforcing the model’s skeleton (latent structure) so it stands tall not just in the lab (greedy decoding) but also on a bumpy road (high temperature). It wins more often, stays concise, and keeps formats together when things get noisy.

05Discussion & Limitations

Limitations (be specific):

Extra moving parts: GANPO adds discriminators and hyperparameters (like λ), making tuning a bit more complex than vanilla DPO/SimPO.
Anchor quality matters: If the reference model’s latent manifold is flawed, anchoring to it can limit exploration and carry over its weaknesses.
Stability trade-offs: Although the relativistic setup helps, adversarial training can still be trickier than purely supervised objectives.
Complementarity: Token and latent regularization likely help in different ways; the best recipe may combine both, which needs more study.

Required resources:

A frozen reference model (already common in OPO).
Two lightweight discriminator networks (ideally Transformer-based for sequence-level feedback).
Usual OPO compute (A100/H200-class GPUs used in the paper) with a modest overhead.

When NOT to use:

If you must have the simplest, most parameter-efficient setup and can’t afford any extra components.
If your reference model is known to be poorly aligned or domain-mismatched, since anchoring to it can mislead the structure.
If your pipeline already uses strong online methods with reliable structure verifiers; the marginal gain may be smaller.

Open questions:

Best blend: What’s the ideal mix of token-level and latent-level regularizers for different tasks and scales?
Smarter anchors: Could we adaptively update the reference or distill from multiple references without losing stability?
Symbolic constraints: How well can we inject compiler checks or schema validators directly into latent losses to enforce strict formats?
Online extension: Can a self-play version of GANPO (policy and discriminator co-evolve with fresh rollouts) deliver PPO-like gains without the heavy complexity?
Multimodal alignment: How does latent adversarial regularization extend to vision-language models, where cross-modal structure matters even more?

06Conclusion & Future Work

Three-sentence summary: GANPO adds a GAN-style latent-space regularizer to offline preference optimization so language models learn not only what people prefer but also how to keep their internal structure healthy. By aligning hidden states with a stable reference model using two relativistic discriminators (for good and bad), GANPO delivers higher win rates, less verbosity bias, and stronger robustness under noisy sampling. It achieves these gains with modest extra compute and without hurting downstream tasks.

Main achievement: Showing that adversarial latent regularization provides dense, structure-preserving feedback that closes key gaps left by token-only regularization in preference training.

Future directions: Combine token and latent regularizers; integrate symbolic/format verifiers into the latent loss; explore online self-play variants; extend to multimodal alignment where cross-modal structure is crucial.

Why remember this: It reframes alignment as guiding the model’s hidden thinking, not just its surface words—leading to responses that stay clear, correct, and well-structured even when generation gets noisy.

Practical Applications

•Improve instruction-following assistants that must keep formats (like JSON) intact under higher sampling temperatures.
•Reduce verbosity bias in customer support chatbots so answers stay concise and helpful.
•Stabilize creative writing tools so tone and structure remain consistent even with diverse generation settings.
•Enhance educational tutors to keep step-by-step reasoning clear under noisy decoding.
•Boost code assistants’ robustness to sampling, preserving syntactic structure and reducing invalid outputs.
•Strengthen long-form summarization models to maintain outline-level coherence under variable temperatures.
•Provide safer, more reliable healthcare or legal information assistants by preserving internal structure and reducing reward hacking.
•Aid enterprise deployment by adding a plug-and-play regularizer that works with existing OPO setups (DPO/SimPO).
•Support multimodal systems in the future by aligning text generation with structured latent guidance that can extend to images.

Version: 1