PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards

Minh-Quan Le; Gaurav Mittal; Cheng Zhao; David Gu; Dimitris Samaras; Mei Chen

PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards

Intermediate

Minh-Quan Le, Gaurav Mittal, Cheng Zhao et al.2/2/2026

arXiv PDF

Key Summary

•This paper shows how to make text-to-video models create clearer, steadier, and more on-topic videos without using any human-labeled ratings.
•The key trick is to align text and video features using Optimal Transport (OT) so rewards are measured in a space that matches how real videos look and move.
•There are two rewards: a Quality Reward that checks overall realism and motion, and a Semantic Reward that checks whether each important word appears in the right place and time.
•The Semantic Reward uses a partial matching plan so it ignores unimportant words and locks onto the key ones (like 'red apple' or 'brown beret').
•This approach, called PISCES, beats both annotation-free and annotation-based baselines on VBench and in human preference studies.
•It works with both direct backpropagation and reinforcement learning (GRPO), so it is flexible to train.
•Ablation studies show OT alignment is crucial: it preserves structure while bringing text closer to the real-video distribution.
•The method improves temporal coherence (less flicker), photorealism, object counts, attributes, and actions.
•Training is efficient and does not require building costly human preference datasets.
•This offers a scalable path to better text-to-video generation while maintaining strong alignment with prompts.

Why This Research Matters

PISCES provides a practical way to make AI-generated videos both more realistic and more faithful to what people ask for—without needing expensive human labeling. This means creators can get better results faster, from educational clips to social content, with fewer resources. By aligning reward signals with how humans judge videos, it reduces common failure modes like off-by-one object counts or wrong colors. The method scales to long videos and different training strategies, making it adaptable to many systems. Because it’s annotation-free, research teams can iterate more quickly and responsibly. In the long run, this helps build T2V tools that are more reliable, useful, and accessible to a wider community.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re directing a school play. You give the actors a script (the text), and they act it out on stage (the video). You want the acting to both look real and match the script exactly—no pirates showing up in a space story!

🥬 The Concept (Text-to-Video Generation): It’s a type of AI that turns written prompts into moving pictures.

How it works: (1) Read the text prompt, (2) imagine the scene, characters, and motions, (3) generate frames over time to make a video.
Why it matters: Without good guidance, the video can look fake or ignore key parts of the prompt. 🍞 Anchor: If the prompt says "a zebra and a bear walking by a lake" but the video shows only a dog, the model failed to follow the script.

🍞 Hook: You know how game coaches give points to players for good moves to improve their performance?

🥬 The Concept (Reward-based Post-Training): It’s an extra learning step after the model is trained, where the model gets reward signals for doing better.

How it works: (1) Generate a video from a prompt, (2) score it with a reward, (3) adjust the model so future videos score higher.
Why it matters: Without good rewards, the model can’t tell what to fix. 🍞 Anchor: If a video gets more points when it matches the prompt and looks smooth, the model learns to produce more accurate, steadier videos next time.

🍞 Hook: Think of two friends who speak different languages trying to describe the same picture. They need a shared dictionary.

🥬 The Concept (Vision-Language Models and Embeddings): A VLM turns text and video into vectors (embeddings) so they can be compared.

How it works: (1) Encode the text into a vector, (2) encode the video into another vector, (3) compare how close they are.
Why it matters: If the embedding spaces don’t match well, comparisons are misleading. 🍞 Anchor: If the text “brown beret and glasses” lands far away from the video showing those items, the model won’t get credit—even if the video is correct.

🍞 Hook: You know how two maps can have different scales—1 inch on one map is 1 mile, and on another it’s 2 miles? If you don’t align scales, directions go wrong.

🥬 The Concept (Distributional Misalignment): Text and video embeddings often sit in spaces that don’t line up statistically.

How it works: Training objectives like contrastive learning don’t guarantee that the “cloud” of text points overlaps the “cloud” of real-video points.
Why it matters: Rewards measured in misaligned spaces can praise the wrong things or miss the right ones. 🍞 Anchor: If your ruler is stretched, you’ll measure everything wrong—even if you’re looking at the right object.

🍞 Hook: Imagine sorting colored marbles from one tray to another while trying to move them in the cheapest way.

🥬 The Concept (Optimal Transport, OT): OT is a mathematical way to move one distribution (like text embeddings) to align with another (video embeddings) with minimal cost.

How it works: (1) Define a cost for moving each point to another, (2) find the best overall transport plan or mapping, (3) transform one set to match the other’s shape.
Why it matters: Once aligned, similarity scores become meaningful and human-like. 🍞 Anchor: After aligning, the text vector for “red apple” lands where real red-apple videos live—so the model gets proper credit for showing a red apple.

The World Before: Text-to-video models were improving fast, but aligning them to human judgment remained hard. Teams either collected large human preference datasets (expensive and slow) or used automatic rewards from VLMs (scalable but inaccurate due to misalignment).
The Problem: Annotation-based methods don’t scale; annotation-free methods use embeddings that don’t match real-video distributions, leading to shaky video quality and off-prompt content.
Failed Attempts: Simple cosine rewards on raw VLM features and naive mapping losses (L2, KL) didn’t fix the deep distribution gap; attention maps were diffuse and missed token-level grounding.
The Gap: A principled way to align reward signals to human judgments—globally for quality and locally for semantics—without human labels.
Real Stakes: Better educational videos, clearer instruction clips, safer robotics simulations, and less time and money spent labeling data.

02Core Idea

🍞 Hook: You know how noise-canceling headphones make the world sound clearer by matching the noise pattern and removing it?

🥬 The Concept (Key Insight): Align the text and video embedding spaces using Optimal Transport, then compute two rewards—one for overall quality and one for word-by-word grounding—so the model learns exactly what humans care about, without human labels.

How it works: (1) Use OT to map text embeddings into the real-video manifold, (2) compute a global Quality Reward by comparing aligned text with generated video, (3) compute a token-level Semantic Reward with a smart partial matching plan that respects meaning, time, and space, (4) fine-tune the generator to maximize these rewards.
Why it matters: Without alignment, rewards point in the wrong direction; with alignment, they become reliable guides. 🍞 Anchor: After alignment, when you ask for “a zebra and a bear,” the model stops drawing one animal or the wrong stripes—it draws both, correctly.

Multiple Analogies:

Mapmaking: First fix the map’s scale and orientation (distributional OT), then draw precise street-to-street correspondences (token-level OT).
Orchestra: Tune the whole orchestra so it sounds harmonious (quality), then make sure each instrument plays its exact notes at the right time (semantic).
Grocery Matching: First move all fruit bins to the right aisle (distribution), then match each apple to the right label—Granny Smith here, Fuji there (tokens).

Before vs After:

Before: Rewards judged videos using misaligned features—like grading essays with a blurry rubric.
After: Rewards are computed in an OT-aligned space, so “good” and “on-prompt” are measured like humans would.

Why It Works (Intuition):

Global Quality: The [CLS] (whole-video summary) of aligned text becomes a proxy for real videos, so cosine similarity measures whether generation points in a “realistic, coherent direction.”
Local Semantics: Partial OT links only the important words to the right spatio-temporal video patches, avoiding noisy, forced matches—and making the reward selective like humans.

Building Blocks: 🍞 Hook: Think of bending a wire frame until two shapes overlap. 🥬 The Concept (Distributional OT Map T*): A neural OT map transforms text embeddings to sit on the real-video manifold.

How it works: Train a transport map and a potential function so moving text to video space is cheapest while preserving structure.
Why it matters: Now quality comparisons are apples-to-apples. 🍞 Anchor: “Brown beret” text lands near videos that actually show brown berets.

🍞 Hook: Picture snapping Lego pieces only where they truly fit. 🥬 The Concept (Token-level Partial OT Plan P*): A plan that softly matches each meaningful text token to the right video patches, guided by meaning, time, and place.

How it works: Build a cost that mixes semantic distance, frame timing, and spatial location; solve with an entropic Sinkhorn algorithm to get a sparse, selective match.
Why it matters: Prevents tokens like “the” from polluting matches, and enforces “when” and “where.” 🍞 Anchor: The token “glasses” attaches to the face region in the right frames—not the background.

🍞 Hook: Imagine two scoreboards, one for overall performance and one for following instructions. 🥬 The Concept (Dual Rewards): One global OT-aligned Quality Reward plus one token-level OT-aligned Semantic Reward.

How it works: Compare [CLS] embeddings for quality; use a Video-Text Matching head over POT-refined features for semantics.
Why it matters: Boosts both realism and prompt faithfulness together. 🍞 Anchor: Videos look smoother and match details like “two zebras,” not one.

03Methodology

At a high level: Prompt → Encode text/video → Align with OT → Compute Dual Rewards → Update generator (direct or RL) → Better videos.

Step A: Prepare Embeddings

What happens: The system encodes the text prompt and the generated video into embeddings using a pre-trained video–language model (e.g., InternVideo2).
Why it exists: We need a common numeric language to compare text and video.
Example: The sentence “A rabbit reads the newspaper” becomes a sequence of text tokens; the video becomes patch tokens across frames plus a [CLS] summary.

🍞 Hook: Think of reshaping a glove so your hand fits perfectly. 🥬 The Concept (Distributional OT Map for Quality): Learn a neural OT map T* that moves text embeddings onto the real-video distribution.

How it works (recipe): (1) Define a transport cost (distance between text and video embeddings), (2) train a mapping T and a potential f with NOT to minimize cost while preserving structure, (3) freeze T* for reward computation.
What breaks without it: Quality signals measured in misaligned space reward the wrong things. 🍞 Anchor: After T*, the text summary [CLS] sits where real, high-quality videos live.

Step B: Compute OT-aligned Quality Reward

What happens: Compute cosine similarity between T*(text [CLS]) and the video’s [CLS]; higher is better quality/consistency.
Why it exists: Cosine of aligned embeddings is a robust indicator of realism and temporal coherence.
Example: A smooth, photorealistic city scene will score closer to real-video-aligned text than a flickery, off-style one.

🍞 Hook: Matching words to the right places is like placing stickers exactly where they belong in a picture book. 🥬 The Concept (Token-level Partial OT with Spatio-temporal Costs): Find a plan P* that aligns key text tokens to the most relevant video patches in the right frames and locations.

How it works (recipe): (1) Build a cost matrix = semantic distance + time penalty + space penalty, (2) solve a partial entropic OT (Sinkhorn) to transport only the important mass (e.g., 90%), (3) produce a sparse, meaningful alignment plan.
What breaks without it: Vanilla attention spreads focus too thin; tokens can land on irrelevant patches. 🍞 Anchor: The token “reads” aligns to frames where the rabbit’s head and newspaper move together, not random background tiles.

Step C: Fuse P* with Attention and Score Semantics

What happens: Combine the OT plan with vanilla cross-attention in log-space to get a refined attention map; pass features through a Video-Text Matching head; take the positive logit as the Semantic Reward.
Why it exists: Keeps gradients through attention while using P* as a structural prior for precise grounding.
Example: For “brown beret,” the refined attention highlights the head region in the correct frames; VTM gives a high semantic score.

🍞 Hook: Imagine learning to ride while holding a steadying hand, or by practicing sets and getting points. 🥬 The Concept (Two Training Routes—Direct Backprop and RL/GRPO): Use the rewards to nudge the generator via (1) direct backpropagation combined with Consistency Distillation (CD), or (2) reinforcement learning (GRPO).

How it works: (1) Direct: L = LCD − ROT-quality − ROT-semantic; (2) RL: sample a group of videos, compute group-relative advantages from rewards, and apply policy updates.
What breaks without it: Without a stable objective (CD) or advantage normalization (GRPO), training can overfit or wobble. 🍞 Anchor: Direct is like steady coaching; RL is like trying multiple takes and keeping the one with the best audience score.

Secret Sauce:

Align first, reward second: OT moves text into the real-video manifold so similarity now respects true video structure.
Be selective: Partial OT ignores unhelpful words and enforces when-where grounding.
Dual signals: Global quality + local semantics improve both smoothness and on-prompt accuracy together.

Example with Data:

Prompt: “A robot reaches for a red apple on the table.”
Quality: T*(“robot reaches … table”) vs video [CLS] → high if motion is smooth and realistic.
Semantics: Tokens “robot,” “red,” “apple,” “table” align to the correct patches/frames → high if the apple is truly red and on the table while the robot reaches.

04Experiments & Results

The Test: The authors used VBench, which scores videos on two main axes: Quality (how real and smooth they look) and Semantics (how well they match the prompt, like objects, colors, actions, and relations). They tested both short-video (VideoCrafter2) and long-video (HunyuanVideo) setups and also ran human preference studies.

The Competition: PISCES was compared to strong baselines, including annotation-free systems like T2V-Turbo and T2V-Turbo-v2, plus annotation-based methods like VideoReward-DPO and VideoDPO that rely on human labels.

The Scoreboard (with context):

On HunyuanVideo, PISCES achieved high Quality and Semantic scores on VBench, surpassing both annotation-free and annotation-based rivals. Think of it as scoring an A+ where others get B or B+.
On VideoCrafter2, PISCES again led in Total, Quality, and Semantic scores, showing it helps in both short and long videos.
In broader model comparisons, post-training HunyuanVideo with PISCES topped other popular T2V systems on automatic metrics.

Human Preference Study: With 400 prompts and 85 raters, PISCES was preferred across visual quality, motion quality, and semantic alignment. This confirms the numbers reflect what people actually like watching.

Ablations (What mattered most):

Using OT was crucial. Without OT, both Quality and Semantics dropped, proving that aligning the space first makes the rewards truly informative.
Quality Reward alone improved smoothness and photorealism.
Semantic Reward alone improved object/action correctness and spatial relations.
Together, they performed the best, confirming the two rewards are complementary.

Surprising Findings:

OT not only brought text closer to video embeddings (higher Mutual KNN) but also preserved the internal ranking of text points (high Spearman), meaning it aligned without warping meaning.
Partial OT with spatio-temporal constraints improved matching by over 8% compared to vanilla attention, showing the importance of being selective about which tokens to ground.
Compared to simple mapping losses (L2 or KL), OT-based alignment produced more stable generations (e.g., changing only the apple’s color from red to green without breaking the scene), indicating fewer sampling artifacts.

Training Efficiency: PISCES added modest training time compared to a strong baseline and avoided the huge effort of collecting human preference labels. It also supported faster inference after consistency distillation in the RL route by reducing denoising steps.

Takeaway: OT-aligned rewards provide a clean, scalable way to guide T2V models toward both realism and faithfulness to prompts—validated by metrics and by people’s eyes.

05Discussion & Limitations

Limitations:

Dependence on the base VLM: If the pre-trained video–language encoder lacks fine-grained localization (e.g., tiny objects, subtle attributes), even a perfect OT plan can’t fully fix it.
Granularity ceilings: Token-level grounding is bounded by the resolution of video patches and the model’s attention quality.
Edge cases: Highly abstract or rare concepts may still challenge the encoder’s representation.

Required Resources:

GPUs for a short multi-day post-training run (e.g., 8×A100 for a few days).
A pre-trained T2V generator (e.g., HunyuanVideo or VideoCrafter2).
A pre-trained VLM (e.g., InternVideo2) to extract embeddings and run the VTM head.

When NOT to Use:

If you rely on an encoder known to be weak for your domain (e.g., medical or niche industrial videos) and can’t swap it.
If you need pixel-accurate spatio-temporal grounding (like precise segmentation) rather than patch-level alignment.
If your compute budget cannot cover any post-training passes.

Open Questions:

Can we jointly fine-tune the VLM with OT to further improve localization while keeping annotation-free training?
How does OT-aligned reward interact with future motion modules and higher frame rates?
Can we design curriculum schedules that vary partial mass or spatio-temporal penalties over training for even stronger alignment?
How robust is performance across entirely different VLM families and domains (e.g., robotics, education, safety-critical)?

06Conclusion & Future Work

3-Sentence Summary: PISCES is an annotation-free way to improve text-to-video models by aligning text and video embeddings with Optimal Transport, then training with two OT-aligned rewards—one for overall quality and one for token-level semantics. This dual-reward design fixes the core problem that prior annotation-free methods faced: misaligned embedding spaces that misguide rewards. Experiments and human studies show consistent boosts in smoothness, realism, and prompt faithfulness, often surpassing annotation-based baselines.

Main Achievement: Turning OT-aligned rewards into a practical training blueprint that scales—achieving human-aligned guidance without the cost and complexity of large preference datasets.

Future Directions: Strengthen fine-grained grounding by co-training the VLM, explore adaptive schedules for partial OT mass and spatio-temporal penalties, and test across more domains and encoders to further validate generality.

Why Remember This: It reveals that the quality of a reward depends on the space it lives in; by aligning that space with OT at both global and local levels, annotation-free alignment becomes not only possible but competitive—and often superior—at scale.

Practical Applications

•Generate accurate educational videos that match lesson scripts (e.g., science demonstrations).
•Create product demos where colors, counts, and actions match exact marketing copy.
•Produce storyboard previews that stay faithful to screenplays, including timing and locations.
•Simulate robotics tasks with precise object attributes for planning and training.
•Make accessibility videos that clearly follow narrated instructions and actions.
•Generate procedural tutorials (e.g., crafts, cooking) that match step-by-step text.
•Produce news or explainer clips that accurately reflect on-screen mentions (names, places).
•Support game cutscene prototyping aligned to narrative prompts without manual annotation.
•Refine long-form content (trailers, ads) with consistent style and object continuity.
•Build safer alignment checks for T2V systems without exposing annotators to sensitive data.

Version: 1