Rethinking Generative Recommender Tokenizer: Recsys-Native Encoding and Semantic Quantization Beyond LLMs

Yu Liang; Zhongjin Zhang; Yuxuan Zhu; Kerui Zhang; Zhiluohan Guo; Wenhang Zhou; Zonqi Yang; Kangle Wu; Yabo Ni; Anxiang Zeng; Cong Fu; Jianxin Wang; Jiazhi Xia

Rethinking Generative Recommender Tokenizer: Recsys-Native Encoding and Semantic Quantization Beyond LLMs

Intermediate

Yu Liang, Zhongjin Zhang, Yuxuan Zhu et al.2/2/2026

arXiv PDF

Key Summary

•This paper proposes ReSID, a new way to turn items into short token codes (Semantic IDs) that are much easier for a recommender to predict.
•Instead of starting from big language models, ReSID learns from the data that really drives recommendations: user behavior and structured item features.
•It has two key parts: FAMAE (to learn what information is actually needed for recommendations) and GAOQ (to build codes that are easy to predict step-by-step).
•FAMAE uses a hide-and-guess game over item fields to keep the most useful information for recommending the next item.
•GAOQ assigns code numbers consistently across the whole system so the same number always means the same direction, which lowers uncertainty during decoding.
•On ten Amazon datasets, ReSID beats strong baselines by over 10% on ranking metrics while cutting tokenization time by up to 122× compared to prior tokenizers.
•Task-aware metrics from FAMAE reliably predict downstream performance, letting you know early if your embeddings are good.
•ReSID shows that great generative recommendation does not require large language models when you align learning with the actual task.
•The approach is information-theoretic at heart: it preserves what matters and reduces what confuses the model during sequence generation.

Why This Research Matters

Better recommendations feel like thoughtful help, not noise. ReSID makes suggestions more accurate by keeping the information that really predicts what you want next and by assigning codes so each decoding step is less confusing. It removes the need for giant language models, saving compute and money, while working well on very large catalogs. Faster tokenization (up to 122×) makes iteration and re-indexing practical in production. The task-aware metrics give early feedback, reducing trial-and-error. Overall, users get more fitting suggestions, platforms save resources, and the system scales more gracefully as catalogs and traffic grow.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you run a toy store. Kids come in and buy crayons, then coloring books, then stickers. If you could spot these patterns, you’d suggest the right next thing and make them very happy.

🥬 The Concept (Sequential Recommendation): It is a way to suggest the next item a user will want, based on the order of things they interacted with before.

How it works:
1. Read the user’s past actions in time order.
2. Learn patterns like “people who bought A often buy B next.”
3. Predict the next likely item.
Why it matters: Without the sequence, you miss the story of the user’s journey and make worse guesses. 🍞 Anchor: After crayons and a coloring book, the next suggestion is sticker sheets instead of a random toy.

🍞 Hook: You know how a chef chops ingredients into small pieces so they cook evenly? Computers also chop information into small pieces to process it better.

🥬 The Concept (Tokenization): Tokenization breaks things (like items or text) into small symbols called tokens so a model can handle them step-by-step.

How it works:
1. Take a big thing (an item with many properties).
2. Turn it into a short sequence of tokens like [21, 3, 54].
3. Use those tokens to predict the next tokens.
Why it matters: Without tokens, models would need to choose from billions of raw IDs at once, which is too hard and slow. 🍞 Anchor: Instead of guessing one item name out of millions, the model guesses a few small numbers in order.

🍞 Hook: Think about how you learn a friend’s taste: not just what the item is called, but what they actually like to buy together.

🥬 The Concept (Representation Learning): It teaches the computer to compress an item into a vector (a list of numbers) that captures what’s important for the task.

How it works:
1. Collect signals (what users do, categories, brand, store).
2. Learn a vector that preserves the patterns useful for prediction.
3. Use that vector to make decisions.
Why it matters: Bad representations forget important clues, so later steps can’t recover them. 🍞 Anchor: “Crayons + Coloring Book” end up close together in the learned space even if their names are different.

🍞 Hook: When you pack for a trip, you squeeze clothes to fit the suitcase while keeping the outfits you need.

🥬 The Concept (Quantization): It turns big continuous vectors into a few discrete codes to save space and make step-by-step generation possible.

How it works:
1. Group similar vectors.
2. Assign each group a small code number.
3. Represent each item by a short sequence of such codes.
Why it matters: If codes are messy or inconsistent, decoding becomes confusing and error-prone. 🍞 Anchor: Rolling your favorite T-shirts (good codes) beats a random pile (messy codes) when you need to pack fast.

🍞 Hook: Reading a story one sentence at a time helps you guess what happens next.

🥬 The Concept (Autoregressive Modeling): The model predicts the next token by looking at previously generated tokens.

How it works:
1. Start with the beginning of the code.
2. Predict the next token given the prefix.
3. Repeat until the full code is built.
Why it matters: If earlier tokens don’t narrow down good choices, the model gets confused and makes mistakes. 🍞 Anchor: If the first two code numbers already point to “party supplies,” the third is easier to guess as “balloons” than “wrench.”

The World Before: Many recent systems tried “semantic-centric” pipelines: use big language or vision models to embed items by meaning (like how items are described), then discretize those embeddings into tokens. This does shrink the search space. But it often misses the real driver of recommendations: collaborative signals—what people actually buy together. Snacks and balloons may be worlds apart in language or pictures, yet are party buddies in behavior.

The Problem: Two mismatches popped up. First, representation misalignment: embeddings tuned for semantic similarity don’t neatly match co-purchase or sequential patterns, even with fine-tuning. Second, quantization that ignores sequential predictability: some methods focus on reconstruction only; others assign child codes locally, so the same code index can mean different things under different parents. Both raise uncertainty when generating tokens.

Failed Attempts:

Pure semantic tokenizers (e.g., TIGER) compress well but don’t inject collaboration, so they miss behavioral patterns.
RQ-style quantizers focus on reconstruction loss but ignore how codes depend on prefixes, causing high decoding uncertainty.
Hierarchical K-Means gives a tree but assigns child indices locally, so index “1” under two different parents can mean totally different things—confusing the generator.
End-to-end joint learning (ETEGRec) ties everything together but destabilizes training because the targets (codes) keep shifting while the model is learning them.

The Gap: We need a recommendation-native pipeline that (1) learns what’s sufficient for predicting the next item and (2) builds codes that are both compact and easy to predict token-by-token, without relying on giant language models.

Real Stakes: In daily life, that means better shopping suggestions (party sets instead of random items), faster search on huge catalogs, cheaper training (no massive LLMs), and more stable systems that generate fewer off-target recommendations.

02Core Idea

🍞 Hook: You know how a good map shows only the roads you need and labels them consistently, so you never get lost—even at night?

🥬 The Concept (ReSID): It is a redesign of how we learn item representations and turn them into codes, so they keep what matters for recommendations and are easy to predict in sequence, all without large language models.

How it works:
1. FAMAE learns exactly the information needed for next-item prediction by hiding item fields and asking the model to guess them using user history.
2. GAOQ turns those learned vectors into short, globally consistent code sequences that lower uncertainty at each decoding step.
3. A simple generative model then predicts codes token-by-token.
Why it matters: If you don’t protect task-relevant info early and make codes prefix-friendly later, the generator learns from noisy, confusing targets. 🍞 Anchor: The system learns that “snacks + balloons + tableware” often go together and encodes them with stable, aligned codes that are easy to generate.

The Aha! Moment (one sentence): If you preserve the information that truly predicts the next item and assign code numbers so the same number always points in the same direction everywhere, generative recommendation gets both smarter and faster.

Multiple Analogies:

Airport signs: FAMAE decides which signs travelers really need (gates, baggage), and GAOQ makes sure sign numbers mean the same thing in every terminal.
Recipe cards: FAMAE keeps only the steps you must follow to bake the cake; GAOQ numbers the steps consistently so any chef can cook without confusion.
Lego sets: FAMAE sorts out essential pieces for this model; GAOQ standardizes pegs and colors so pieces click together predictably.

Before vs After:

Before: Embeddings learned for meaning, not behavior; codes assigned locally or only to minimize reconstruction error; decoding is uncertain and slow; results often trail strong sequential recommenders with side info.
After: Embeddings preserve collaboration-dominant, field-level signals; codes are globally aligned and prefix-predictable; decoding is more certain; results beat strong baselines without LLMs and tokenize much faster.

🍞 Hook: Picture a detective who asks the most telling questions first, then uses the answers to narrow suspects.

🥬 The Concept (Mutual Information): It measures how much one thing tells you about another. ReSID’s FAMAE objective increases the information the representation has about the item’s structured fields.

How it works:
1. Hide some fields (like category or store) of the target item.
2. Predict them from user history plus the remaining fields.
3. The better you predict, the more task-relevant information you’re packing into the representation.
Why it matters: If the representation carries little information about the fields, quantization will discard key clues the generator needs. 🍞 Anchor: If the model can guess the target item’s category from history alone, it’s keeping powerful signals that help predict the next token.

Building Blocks:

Collaboration-dominant learning (FAMAE): Focuses on what user histories and structured fields jointly reveal about the next item.
Globally consistent coding (GAOQ): Makes sure code “3” at level 2 points in the same direction across all parents, so prefixes guide decoding reliably.
Task-aware diagnostics: Simple, early signals (two metrics) tell you if your embeddings will likely work well downstream.

Why It Works (intuition, not equations):

Step 1 packs in the information needed for the real job (predict the next item) instead of generic semantics that might conflict with behavior.
Step 2 shrinks choices at each token so the decoder has fewer ways to go wrong; consistent indices across the tree mean less prefix confusion.
Together, you both preserve what matters and reduce what misleads.

🍞 Anchor: With party items, FAMAE learns the party pattern; GAOQ gives them stable, shared subcodes so the generator confidently walks down the right path.

03Methodology

At a high level: User history and item fields → FAMAE learns field-aware, collaboration-dominant vectors → GAOQ turns vectors into short, globally aligned code sequences (SIDs) → A small generative model predicts those codes token-by-token.

Step A: Field-Aware Masked Auto-Encoding (FAMAE) 🍞 Hook: Think of a fill-in-the-blank quiz where you sometimes hide the category, sometimes the store, sometimes the item ID, and ask, “Can I still guess it from the user’s past?”

🥬 The Concept (FAMAE): It trains a transformer to predict masked item fields using the user’s sequence and the unmasked fields of the target, which makes the learned vector keep just the information needed for recommendations.

What happens:
1. For the target position (the very last item in the input window), randomly hide K out of J fields (e.g., item ID, category levels, store ID).
2. Replace each hidden field with its own special mask token (so the model knows which field is missing).
3. Feed the whole sequence (history + masked target) into a bidirectional transformer.
4. Ask the model to guess each hidden field from its final hidden state at the target position.
5. Train it so that guessing becomes accurate.
Why this step exists: If we don’t tell the model exactly which fields to reconstruct, it might smash all signals together and forget important identities (like item vs category). Later, quantization would lose critical clues.
Example with actual data: Suppose the last interaction is a hidden item whose category3 and store ID are masked. The user history shows “crayons → coloring book → stickers.” FAMAE learns to predict category3 = “Party Crafts” and store = “CraftyCo,” forcing the representation to capture party-like patterns useful for the next purchase. 🍞 Anchor: Like a teacher who tests each chapter separately, FAMAE makes sure the model truly understands each field that matters for the final exam (the recommendation).

Secret in FAMAE: Two task-aware metrics

Metric 1 (full-field masking): Hide all target fields and measure how well the model predicts them from history alone. This checks collaborative predictability.
Metric 2 (single-field masking of item-ID): Hide only item ID and predict it. This checks discriminative semantic structure.
Why clever: These fast checks tell you early if embeddings are good enough for building strong SIDs, without retraining the whole pipeline.

Step B: Globally Aligned Orthogonal Quantization (GAOQ) 🍞 Hook: Imagine every subway line uses the same color scheme in every station. Blue always means the same direction, so you never board the wrong train.

🥬 The Concept (GAOQ): It turns vectors into short multi-level codes while aligning code indices globally so the same index at a given level means the same direction everywhere, making decoding more predictable.

What happens (like a recipe):
1. Level 1: Balanced K-Means groups all item vectors into b1 clusters and assigns the first code.
2. Level 2+: For each parent cluster, split into bl child clusters (balanced K-Means again).
3. Center each child centroid by subtracting its parent centroid (so all parents align to a common origin).
4. Create a small set of approximately orthogonal anchor directions shared across all parents at this level.
5. Use one-to-one matching (Hungarian algorithm) to assign each child cluster to an anchor, so index k means the same direction everywhere.
6. Repeat for deeper levels until codes are short but precise.
Why this step exists: If child indices are assigned locally and randomly, the same code number can mean different things under different prefixes, confusing the autoregressive model and raising uncertainty.
Example with actual data: Items co-bought for parties (snacks, balloons, tableware) get second-level codes that align to the same anchor direction across parents; a “vase” gets a different one. Now the second token consistently narrows to “party supplies” rather than scattering meaning. 🍞 Anchor: Like standardizing road signs across cities, GAOQ makes sure code 3 at level 2 always points to the same road, no matter where you start.

Step C: Generative Modeling on SIDs 🍞 Hook: Reading the first few lines of a story helps you predict the next line.

🥬 The Concept (Autoregressive SID generation): A small encoder–decoder predicts the SID of the next item one token at a time, using the user’s past SIDs as context.

What happens:
1. Encode the history of SIDs.
2. Predict the first token of the next item’s SID.
3. Use the prefix to predict the next token, and so on, until the full SID is generated.
Why this step exists: Predicting items directly from billions of IDs is too hard; predicting a few small tokens is tractable and aligns with how sequences are modeled.
Example: After seeing SIDs for party-ish items, the model predicts a level-1 code near “home/party,” a level-2 code aligned to “party supplies,” then finishes with a disambiguating final code. 🍞 Anchor: It’s like guessing the next word after “Once upon a…”—the prefix makes the next choice easier.

The Secret Sauce:

FAMAE preserves “what matters” for the task by supervising each field separately at the last position (aligned with next-item prediction).
GAOQ reduces decoding uncertainty by making each level’s indices globally consistent, so prefixes reliably narrow the search.
Combined, they shift the pipeline from semantic-centric to recommendation-native, which both improves accuracy and slashes tokenization cost (up to 122× faster than a strong learnable tokenizer).

04Experiments & Results

🍞 Hook: If two basketball teams play ten games and one wins by double digits every time, you don’t need a microscope to see who’s better.

🥬 The Concept (The Test): The authors measured how well different systems recommend the next item using Recall@K and NDCG@K on ten Amazon datasets (e.g., Musical Instruments, Video Games, Baby Products, Books). They also timed how long tokenization takes.

How it works:
1. Train each method on user histories with standard splits.
2. Evaluate top-5 and top-10 ranking quality (Recall and NDCG).
3. Compare against both classic sequential models and recent SID-based generative models.
4. Measure runtime for the quantization/tokenization stage.
Why it matters: These metrics reflect how often the right item appears near the top, and runtime tells you if it’s practical at scale. 🍞 Anchor: It’s like seeing if the right birthday present shows up in the top 10 suggestions, and whether your system can prepare those suggestions in minutes instead of hours.

The Competition:

Sequential recommenders (IDs only): HGN, SASRec, BERT4Rec, S-Rec.
Sequential recommenders with structured features (fairer): HGN*, SASRec*, BERT4Rec*, S-Rec*.
SID-based generative: TIGER, LETTER, EAGER, UNGER, ETEGRec.

The Scoreboard with Context:

Overall, ReSID consistently achieves the best results across all ten datasets and all reported metrics.
Against the best prior SID tokenizer (LETTER), ReSID averages +16.0% / +13.8% on Recall@5/10 and +16.2% / +14.9% on NDCG@5/10—a clear, durable lead (like going from a B to a solid A on every quiz).
Importantly, even when strong sequential baselines are allowed to use the same structured features (the fair setup), ReSID still wins. This shows the gains are from better tokenization design, not just extra side info.
Tokenization efficiency: ReSID’s GAOQ is 77×–122× faster than LETTER and about 5× faster than TIGER in the quantization stage on million-scale datasets. That’s the difference between waiting a school day vs a coffee break.

Ablations (what matters most):

Replace FAMAE with LLM/text embeddings or with standard sequence encoders? Performance drops. This validates that field-aware, last-position masking preserves the right task information.
Replace GAOQ with RQ-VAE or Hierarchical K-Means with local indexing? Performance drops. This shows global alignment of indices is crucial for lower decoding uncertainty.

Surprising Findings:

End-to-end combined learning (ETEGRec) underperforms ReSID despite being tightly coupled. The reason: when the codes themselves shift during training, the generator’s target keeps moving, hurting stability.
Task-aware FAMAE metrics (Metric 1: all fields masked; Metric 2: only item ID masked) correlate strongly with downstream Recall@10. These act like early report cards that predict final grades.
Scaling trend: As models get bigger (within tested ranges), ReSID improves more favorably than a semantic-centric tokenizer, suggesting the interface (good codes) lets extra capacity pay off.

🍞 Anchor: Imagine two organizers: one labels boxes inconsistently and needs hours to pack; the other uses the same labels across all rooms and finishes quickly. ReSID is that second organizer—and the packed boxes (codes) make unpacking (decoding) easy too.

05Discussion & Limitations

🍞 Hook: Even superheroes have kryptonite; knowing the limits keeps you safe and smart.

🥬 The Concept (Limitations and Practicalities): ReSID is strong, but not magic. Here’s the honest view.

Limitations:
1. Diagnostics for GAOQ: While we have clear, task-aware metrics for FAMAE embeddings, we lack equally principled, easy-to-compute health checks for GAOQ beyond performance and overlap analyses.
2. Training speed of generative models: SID-based generators typically converge slower than classic item-ID models like SASRec; expect longer training times even though tokenization is much faster.
3. Data dependency: The method relies on meaningful structured fields (e.g., multi-level categories, store IDs). In domains with very sparse or low-quality fields, benefits may shrink.
4. Hyperparameter sensitivity: Branching factors per level trade off capacity and predictability; poor choices can slightly hurt performance until tuned.
Required Resources:
1. A modest transformer for FAMAE (similar to lightweight sequential models).
2. Compute for balanced K-Means and matching (fast and non-neural), scalable to millions of items.
3. A small encoder–decoder for SID generation.
When NOT to Use:
1. Tiny catalogs where item-ID models are already trivial and fast.
2. Domains lacking reliable structured features or with extremely volatile item taxonomies.
3. Settings requiring real-time re-tokenization of rapidly changing items (unless GAOQ is batched offline).
Open Questions:
1. Can we design quick, theory-backed GAOQ metrics that predict decoding uncertainty before training the generator?
2. How do we best schedule or warm-start the SID generator to speed up convergence while preserving stability?
3. Can we adapt GAOQ to handle cold-start items or dynamic catalogs with minimal re-clustering?
4. How far can we push parallel decoding or partial prefix prediction with globally aligned codes?

🍞 Anchor: Think of ReSID as a sturdy bicycle tuned for long rides; it’s fast and dependable on most roads, but you still need to check the tires (metrics) and choose the right gears (branching factors) for the terrain.

06Conclusion & Future Work

Three-Sentence Summary:

ReSID rethinks how we learn and discretize item representations for generative recommendation, focusing on preserving task-relevant information and lowering sequential uncertainty—without large language models.
Its two parts—FAMAE for collaboration-dominant, field-aware learning and GAOQ for globally aligned, prefix-friendly codes—work together to produce compact, predictable Semantic IDs.
Across ten datasets, ReSID beats strong baselines by over 10% and makes tokenization up to 122× faster, while offering simple, task-aware diagnostics for embedding quality.

Main Achievement: Showing that a recommendation-native, information-theoretic design of both representation learning and code assignment can surpass semantic-centric pipelines and even strong sequential baselines augmented with side information.

Future Directions:

Develop fast, principled GAOQ health metrics that forecast decoding uncertainty.
Speed up SID generator convergence via curriculum learning, better initialization, or hybrid decoding strategies.
Extend to dynamic catalogs with incremental re-quantization and anchor adaptation for cold-start.
Explore richer field sets (attributes, price bands, freshness) and automatic weighting per field.

Why Remember This: ReSID demonstrates that when you keep only what truly predicts the next choice and label it consistently everywhere, generative recommenders get both sharper and leaner. It turns tokenization from a semantic craft into a task-aligned science, proving you don’t need giant language models to get state-of-the-art results.

Practical Applications

•Build a generative recommender for a large e-commerce site using SIDs instead of item IDs to reduce search space.
•Pretrain item embeddings with FAMAE on structured fields (e.g., multi-level categories, store) to preserve task-relevant signals.
•Quantize item embeddings with GAOQ to create globally aligned, short code sequences that lower decoding uncertainty.
•Use the two FAMAE metrics during training to decide when embeddings are good enough before training the generator.
•Tune GAOQ branching factors to balance capacity (reconstruction) and predictability (decoding) for your catalog size.
•Migrate from a semantic-centric tokenizer (e.g., RQ-VAE) by swapping in GAOQ while keeping your existing generator.
•Speed up re-tokenization for periodic catalog refreshes, leveraging GAOQ’s non-neural, fast pipeline.
•Deploy beam search with shallow beams thanks to more predictable codes, cutting inference latency.
•Use field-specific mask tokens to keep field identities crisp, which improves downstream quantization stability.
•Monitor SID overlap between targets and history to validate that collaborative structures survive discretization.

Version: 1