Beyond Token-level Supervision: Unlocking the Potential of Decoding-based Regression via Reinforcement Learning

Ming Chen; Sheng Tang; Rong-Xi Tan; Ziniu Li; Jiacheng Chen; Ke Xue; Chao Qian

Beyond Token-level Supervision: Unlocking the Potential of Decoding-based Regression via Reinforcement Learning

Intermediate

Ming Chen, Sheng Tang, Rong-Xi Tan et al.12/6/2025

arXiv PDF

Key Summary

•The paper shows that making a model write a number as a sequence of digits and then grading the whole number at the end works better than grading each digit separately.
•They turn number prediction into a small game (an MDP) and use reinforcement learning (RL) to reward the model only after it finishes the entire number.
•This sequence-level reward fixes the mismatch between token-level training (cross-entropy on digits) and real regression goals (getting the final number right).
•Two simple RL recipes, ReMax and GRPO, are used to fine-tune models and consistently beat strong baselines on 100 tabular tasks and code-to-metric prediction.
•ReMax and GRPO improve accuracy, with ReMax being especially robust across different digit tokenizations.
•RL makes the model’s output distribution sharper, which boosts single-sample accuracy and sampling efficiency but can reduce exploration.
•The approach works with both normalized tokenization and scientific-notation tokenizers (like IEEE), though unbounded tokenizers can still produce outliers.
•Overall, decoding-based regression plus sequence-level RL becomes a reliable, accurate way for general-purpose numerical prediction.
•This matters for everyday predictions—like prices, wait times, or resource needs—because you care about the final number, not whether each digit was guessed well along the way.

Why This Research Matters

In real life, we care about the final number—how much time, cost, energy, or memory something will take. Training models to focus on the whole number instead of each digit makes predictions more accurate and more useful. This helps apps plan resources better (like bikes, compute, or budget), improves reliability when only one guess is allowed, and reduces weird failure cases caused by local token errors. The method also works with text and code, so it can predict metrics straight from natural inputs without custom feature engineering. As a result, companies and researchers can build simpler, more general systems that still hit strong accuracy targets.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how when you guess someone’s age, being off by one year is fine, but being off by ten years is a big deal? What matters is the whole number, not how close you were on each digit.

🥬 Filling (The Actual Concept)

What it is: This paper is about predicting numbers better by teaching AI to care about the whole final number instead of grading it digit by digit.
How it works (story of the field):
1. Before large language models (LLMs), people used classic regression tools like decision trees or Gaussian Processes. They were solid on structured data (tables) but didn’t easily handle unstructured stuff like text or code.
2. Deep nets added power, and newer ideas like histogram (Riemann) heads improved robustness by predicting probability over bins.
3. With LLMs, a fresh idea—decoding-based regression—appeared: turn a number into a small sequence of tokens (like 6 becomes <1><1><0> in base-2) and train the model to generate these tokens.
4. Most models train with cross-entropy (CE), which treats each token like a separate category. But digits aren’t just labels—they have order and size. Being off by 1 on the most significant digit can ruin the whole number.
5. People tried to fix this by adding token-level distance losses (like NTL and DIST), but they still judged each token locally, not the whole number.
Why it matters: If you grade per-digit, you can get a number that looks locally OK but is globally wrong (for example, getting the exponent wrong in scientific notation). For real tasks—like predicting energy use, wait times, or memory usage—only the final number matters.

🍞 Bottom Bread (Anchor) Imagine you’re baking a cake: judging each ingredient separately (flour good, sugar fine) doesn’t guarantee the cake tastes right. You need to taste the whole cake to know if it’s good. That’s the problem this paper tackles: taste (grade) the whole number.

Now, let’s introduce the key ideas in the right order, each with a simple sandwich explanation.

🍞 You know how a teacher grades your entire math answer, not each pencil stroke? 🥬 Decoding-based Regression: It’s a way to predict a number by generating it as a short sequence of tokens (digits/signs/exponent) instead of predicting it in one shot.

How it works: (a) turn the target number into tokens; (b) use a decoder to predict tokens one by one; (c) detokenize to get a scalar; (d) if desired, sample several predictions and aggregate (mean/median).
Why it matters: It lets LLMs do regression on text/code and leverages their strength in sequence modeling. 🍞 Anchor: Predicting 6 as <1><1><0> in base-2 and then converting it back to 6.

🍞 Imagine cutting a pizza into slices so friends can share easily. 🥬 Tokenization Strategies: These are ways to break a number into pieces that a model can generate.

How it works: Normalized tokenization scales numbers to [0,1] then writes digits in base-B; scientific notation uses sign–mantissa–exponent (like IEEE or P10).
Why it matters: Good tokenization keeps precision and avoids out-of-range confusion; bad choices can cause outliers or sensitivity to extremes. 🍞 Anchor: 0.6 becomes <1><1><0> in base-2; 1.23e-2 becomes <+><1><2><3><E-2>.

🍞 Picture a game where you only get your score after you finish the level. 🥬 Token-level vs. Whole-number Grading: CE trains per token; real regression wants the final number right.

How it works: CE ignores the order/size of digits; token-level distance helps a bit but still judges locally.
Why it matters: Small token mistakes in the wrong place can cause huge numeric errors. 🍞 Anchor: Predicting 101 vs 200 for target 100—both may differ by a token, but 101 is clearly better. CE may not see that.

🍞 Think of a board game: your next move depends on where your piece is now. 🥬 Markov Decision Process (MDP): A way to model decision-making where each action changes the state.

How it works: State = (features, tokens so far); Action = choose next token; Transition = append token; Reward = only after finishing the whole number.
Why it matters: MDP turns number generation into a small game you can solve with RL. 🍞 Anchor: From <1><1> you choose <0> to finish <1><1><0> (which decodes to 6).

🍞 You know how you learn faster when your coach gives you feedback on your whole performance, not just each step? 🥬 Reinforcement Learning (RL): A way for models to learn by trying complete attempts and getting rewards.

How it works: Generate a full number, compare it to the truth with a sequence-level score (like negative MSE), and adjust the model to make high-reward sequences more likely.
Why it matters: RL aligns training with the real goal—accurate final numbers. 🍞 Anchor: Predict the full number, then get a reward proportional to how close you were.

02Core Idea

🍞 Top Bread (Hook) Imagine a spelling bee where you only get a score after you say the entire word. If you miss a letter at the end, the whole result changes. That’s how numbers work too: the full thing matters.

🥬 Filling (The Actual Concept)

Aha! Moment (one sentence): Train the model to generate the whole number as a sequence and reward it only after the entire number is formed, so learning targets the true regression error instead of token-by-token guesses.

Multiple Analogies (3 ways):

Cooking: You don’t grade salt and sugar separately—you taste the final soup and give one score.
Archery: Hitting near the bullseye is much better than hitting the outer ring, even if both shots “look similar” from far away.
Jigsaw: A correct puzzle is judged when all pieces are placed; judging one piece at a time doesn’t tell you if the overall picture is right.

Before vs After:

Before: Cross-entropy on digits treated each token as a separate category. Token-level patches (like NTL/DIST) helped a bit but still missed the final numeric magnitude.
After: Sequence-level RL uses a single reward at the end (like negative MSE), directly aligning training with real numeric goals. Result: better precision, robustness, and sampling efficiency.

Why It Works (intuition):

Numbers have structure: some digits (like exponent or leading digits) change the value a lot. Token-level loss doesn’t reflect that.
A sequence-level reward measures closeness of the final scalar—so the model learns to care more about the important positions.
RL with policy gradients increases the likelihood of whole, high-reward sequences. This sharpens the output distribution toward correct numbers and improves single-sample accuracy.

Building Blocks (each with a mini sandwich explanation):

🍞 You know how a chore chart gives you one star for finishing all tasks? 🥬 Sequence-level Rewards: One score after the whole number is produced (e.g., negative MSE between prediction and truth, possibly after normalization).

How it works: Generate tokens → detokenize → compute reward on the scalar.
Why it matters: It directly teaches “closeness” in number-space. 🍞 Anchor: Predict 101 vs 200 for target 100; reward favors 101 much more.

🍞 Think of choosing your next move in chess based on the board now. 🥬 Policy Gradient: A way to adjust a model’s probabilities by pushing up actions that led to higher rewards.

How it works: Sample sequences, compute rewards, subtract a baseline to reduce variance, and nudge token probabilities toward better outcomes.
Why it matters: It efficiently learns from whole attempts without needing a differentiable reward. 🍞 Anchor: If greedy decoding scores 60 but your sample scores 75, you push toward that sampled choice.

🍞 You know how using the best version of your notes helps you improve faster? 🥬 ReMax: A lightweight policy-gradient method that uses the reward from greedy decoding as the baseline.

How it works: Compare each sampled sequence’s reward to the greedy one; push probabilities up if better, down if worse.
Why it matters: Simple, low variance, strong performance. 🍞 Anchor: If the greedy guess gets 0.70 R² and your sample gets 0.75, move toward that sample.

🍞 Imagine judging your run by how far above or below the team average you scored. 🥬 GRPO: Another policy-gradient method that groups several samples, normalizes by the group’s mean and standard deviation, and clips ratios for stability.

How it works: Sample G sequences; center and scale their rewards; use importance sampling/clipping.
Why it matters: Can stabilize learning with multiple references, though sensitive to tokenization choices here. 🍞 Anchor: If your try was much better than the batch average, it gets a strong positive push.

🍞 Bottom Bread (Anchor) When asked to predict bike rentals, the RL-tuned model focuses on making the final number close to reality, not just getting each digit “kind of right.” Tests show it beats strong baselines across many datasets.

03Methodology

At a high level: Input → Encode features → Decode tokens step-by-step → Detokenize to a number → Compute a sequence-level reward → Policy-gradient update → Output improved predictions.

Step-by-step (with what/why/examples):

Input and Encoding

What happens: Take x (e.g., tabular row or code snippet) and transform it into a vector representation φ(x) using an encoder (MLP for tables; T5Gemma encoder for code, kept frozen as in prior work).
Why this step exists: The decoder needs a compact summary of the input to condition its token choices. Without it, the model can’t connect inputs to outputs.
Example: A bike-sharing row (temperature, humidity, day) is mapped to φ(x) ∈ R^256.

Tokenization of Targets

What happens: Turn the target number y into tokens.
- Normalized tokenization: scale to [0,1], then base-B digits for M steps.
- Scientific notation: sign + mantissa digits + exponent (e.g., IEEE/P10).
Why this step exists: Decoding-based regression predicts sequences, not a single scalar, so y must be sequence-ready. Without good tokenization, precision or range can suffer.
Example: y = 6 becomes <1><1><0> in base-2 with length 3; or 1.23e-2 becomes <+><1><2><3><E-2>.

Autoregressive Decoding

What happens: The decoder predicts the next token given φ(x) and the tokens so far, until reaching the fixed length (or end marker). Multiple candidates can be sampled at temperature 1.0.
Why this step exists: It lets the model represent uncertainty and multimodality. Without autoregression, you’d lose the LLM’s strengths on sequences.
Example: From state (<1><1>) it must choose <0> or <1>; the choice determines big numeric changes.

Detokenization to a Scalar

What happens: Convert the generated token sequence back to a number ȳ.
Why this step exists: We can only measure real regression error after we have the final number. Without detokenization, no numeric reward is possible.
Example: <1><1><0> (base-2) → 6; <+><1><2><3><E-2> → 1.23×10^-2.

Sequence-level Reward Design

What happens: Compute R(τ) after a full sequence using a distance in target space. The paper mainly uses negative MSE on a normalized/quantile-transformed target.
Why this step exists: It links learning directly to regression goals. Without sequence-level reward, the model may optimize easy tokens and miss the true number.
Example: If truth is 100, prediction 101 gets a better reward than 200; rewards can be clipped (e.g., min −50) for stability.

Policy-Gradient Update (REINFORCE-style)

What happens: Adjust token probabilities to make high-reward sequences more likely.
- ReMax baseline: use greedy decoding’s reward as the baseline.
- GRPO baseline: use the mean and standard deviation from a group of G samples, plus importance-sampling and clipping.
Why this step exists: It turns the non-differentiable reward into a usable learning signal. Without it, the model can’t learn from the final scalar error.
Example: If sampled sequence beats the greedy baseline, increase its log-probabilities; if worse, decrease them.

Aggregation at Inference

What happens: At test time, sample m candidates and aggregate (mean or median). Different aggregations target different risk profiles and metrics.
Why this step exists: Aggregation stabilizes outputs and matches evaluation metrics (e.g., median can resist outliers).
Example: Generate 128 samples for each x and take the median as the final prediction.

Secret Sauce (What makes it clever)

Grading the whole number, not each token: The reward only arrives after detokenization, matching the real objective.
Lightweight RL: REINFORCE variants (ReMax/GRPO) exploit deterministic transitions in decoding; no critic/value model is required.
Sampling efficiency: RL makes single-sample predictions stronger (best@1), which means fewer samples are needed to be accurate in practice.

Concrete Mini-Examples:

Tabular with normalized tokenization (base-2, length 8):
- y is z-scored, then min–max scaled to [0,1], then written as 8 base-2 digits.
- During RL, the model samples 16 rollouts per input; rewards computed as negative MSE back in the original target space after inverse transforms.
Code with IEEE tokenizer:
- y is mapped via a quantile transform to a near-Gaussian space to reduce outlier impact; G = 4 rollouts per input; median of 64 samples for evaluation.

What breaks without each step:

No encoder: tokens can’t reflect input features → random numbers.
No tokenization: no sequence to decode → can’t use the LLM’s strengths.
No detokenization: can’t compute numeric reward → no RL learning.
No sequence-level reward: misaligned objective → good tokens, bad numbers.
No baseline in policy gradient: updates become noisy → unstable learning.

Putting it together as a recipe: Input x → Encode to φ(x) → Decode tokens (sample several) → Detokenize to numbers → Compute sequence-level rewards (e.g., −MSE) → Policy-gradient update (ReMax/GRPO) → Repeat → At test time, sample and aggregate (mean/median) → Final prediction.

04Experiments & Results

The Test (what and why):

Domains: (1) 100 tabular regression tasks from TALENT; (2) code-to-metric prediction (APPS Leetcode peak memory; Triton kernel latency).
Metrics: RMSE (lower is better), R² (higher is better), and Spearman’s rank correlation (higher is better). These measure error size, explained variance, and order consistency.
Why: They show both accuracy (how close) and reliability (how well the model preserves ordering), which matter for real decisions.

The Competition (who we compared against):

Pointwise head (classic scalar regression), Riemann head (histogram over bins), and decoding-based baselines using token-level improvements (NTL-WAS, NTL-MSE, DIST) trained with cross-entropy.

The Scoreboard (with context):

Tabular (TALENT, normalized tokenization):
- Base decoder model already competitive with Riemann.
- GenRe–ReMax improves across the board: for example, RMSE ≈ 0.519 vs ≈ 0.548 base (like moving from a B to an A), and stronger average R² and rank correlation.
- GenRe–GRPO also improves but is more sensitive to the choice of digit base; performance drops at base=10.
- Across digit bases (2 to 10), ReMax remains robust and typically best in R²; GRPO degrades as base grows.
Code metric regression:
- On APPS Leetcode, ReMax improves RMSE and R² over the base model (from near zero R² to a clear positive R²), and boosts rank correlation close to 0.97.
- On Triton kernel latency, ReMax matches base RMSE and improves rank correlation, while token-level finetuning often harms performance (even catastrophically with some losses), suggesting RL preserves ability better than SFT in this setup.

Surprising Findings:

Best@k vs mean/median: Like in RL for reasoning LLMs, RL increases best@1 (single-shot quality) but can reduce performance at very large k (less exploration), while still delivering better mean/median R² across practical k.
Entropy shrinks: RL makes the output distribution sharper (lower entropy), which raises sampling efficiency but can over-sharpen uncertainty.
Reward standardization in GRPO is a key sensitivity: removing it largely fixes performance drops at high digit bases. This indicates normalization can bias gradients in this regression setting.

Making the numbers meaningful:

Think of RMSE going from ~0.548 to ~0.519 on average across 100 tasks as moving a class test average from 82 to 88—consistently better, not just on one or two questions.
A rank correlation around 0.78 vs ~0.77 means orderings are more trustworthy, helping with prioritization decisions (e.g., which items are likely larger/smaller).
A jump in APPS Leetcode rank correlation to ~0.967 is like going from guessing the race finish order to almost perfectly lining up the winners.

Extra diagnostics:

Wasserstein-1 distance (distance between the model’s output distribution and the target) drops under RL—evidence that the full output distribution is moving closer to the truth.
Visuals show the base model’s distribution is high-entropy and biased, while the RL-tuned model becomes low-entropy and better centered near the true y.

05Discussion & Limitations

Limitations (be specific):

Outliers with unbounded tokenizers: Scientific-notation tokenizers (P10/IEEE) can produce extreme values due to hallucinated exponents or mantissas; RL reduces but does not eliminate this.
Over-sharpened uncertainty: RL improves single-sample accuracy by making predictions more certain, which can hurt calibration and decrease exploration (worse best@k at large k).
Sensitivity in GRPO: Reward standardization and clipping help stability but, in this setting, can bias learning; GRPO performance varied with digit base.

Required Resources:

Compute for rollouts: Typical budgets were G=16 (tabular) and G=4 (code), with 100–200 training epochs and sampling m=64–128 at evaluation; implemented with accelerate and DeepSpeed ZeRO-2.
Data transforms: Need careful target normalization (z-score or quantile) matched to tokenization and domain to avoid reward blow-ups.

When NOT to Use:

Ultra-high-precision regimes where tiny rounding differences matter more than sequence flexibility (e.g., exact physics constants) and where simple pointwise regressors suffice.
Settings with extremely scarce compute or strict latency budgets that cannot afford even small rollout sampling.
Tasks where uncertainty calibration and full distribution modeling (not just point predictions) are the top priority without any post-calibration plan.

Open Questions:

Can we keep the accuracy gains while preserving calibration? (Entropy regularization, temperature control, or post-hoc calibration may help.)
Better RL algorithms for regression: Can we design advantage baselines and normalizations that avoid bias yet keep stability across tokenizations?
Verifier-style training: Can GenRe be combined with generative reward models or verifiers to judge entire numeric reasoning chains, not just endpoints?
Architecture synergy: How does GenRe pair with modern tabular foundations (e.g., TabPFN variants) or graph-structured encoders for tables?
Exploration vs exploitation: Can pass@k-style objectives or negative reinforcement help widen the search without sacrificing single-sample quality?

06Conclusion & Future Work

3-Sentence Summary: This paper shows that treating number prediction as sequence generation and grading the whole number at the end with reinforcement learning fixes the mismatch of token-level training. Using ReMax and GRPO, the approach outperforms strong pointwise, histogram, and improved token-level baselines across 100 tabular tasks and code-to-metric prediction. It also boosts sampling efficiency and single-sample accuracy by making outputs sharper and more numerically coherent.

Main Achievement: The key contribution is GenRe: a simple, effective sequence-level RL framework for decoding-based regression that directly optimizes numerical accuracy via end-of-sequence rewards.

Future Directions:

Develop RL methods that maintain accuracy without over-sharpening, improving calibration and uncertainty estimates.
Explore verifier-style training and more robust tokenizers to handle unbounded ranges with fewer outliers.
Combine with modern tabular and multimodal encoders to broaden applicability.

Why Remember This: If you care about the final number, you should grade the final number. GenRe shows that a small shift—from token-level losses to sequence-level rewards—unlocks the full potential of LLM-style decoding for regression, making predictions both stronger and more practical in real-world settings.

Practical Applications

•Predict cloud compute costs or runtime directly from code snippets to optimize budgets.
•Forecast peak memory usage for programs before deployment to prevent crashes.
•Estimate wait times (e.g., customer service or delivery) from textual tickets or logs.
•Predict energy consumption for tasks or workflows described in natural language.
•Score risk or demand levels from mixed inputs (text notes + numbers) in operations dashboards.
•Do quick what-if analysis by editing short text prompts and seeing updated numeric outcomes.
•Improve black-box optimization by using the model’s single-sample efficiency to guide next trials.
•Calibrate alerts in monitoring systems by using median aggregation to resist outliers.
•Auto-tune model parameters (e.g., batch size, memory limits) by predicting numeric performance metrics.
•Rank candidates (e.g., products, configs) by predicted score using the model’s strong rank correlation.

Version: 1