CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

Yinghao Ma; Haiwen Xia; Hewei Gao; Weixiong Chen; Yuxin Ye; Yuchen Yang; Sungkyun Chang; Mingshuo Ding; Yizhi Li; Ruibin Yuan; Simon Dixon; Emmanouil Benetos

CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

Intermediate

Yinghao Ma, Haiwen Xia, Hewei Gao et al.2/28/2026

arXiv

Key Summary

•Modern music AIs can follow text, lyrics, and even example audio, but judges that score these songs have not kept up.
•This paper builds CMI-RewardBench, a single playing field to test how well reward models judge music under mixed instructions (text, lyrics, audio).
•It creates two datasets: CMI-Pref-Pseudo (110k AI-judged, consistency-checked pairs) and CMI-Pref (4,027 expert-judged pairs with confidence and reasons).
•The authors design CMI-RM, a small (about 30M parameters) reward model that understands text, lyrics, and audio and outputs two scores: musicality and alignment.
•Across many tests, CMI-RM agrees with human judges more than popular general models, and often matches or beats specialized open baselines.
•They show test-time scaling: generate several songs, then let the reward model pick the best (top-k), which boosts quality without retraining.
•The benchmark also reveals that even top multimodal LLMs struggle to reach 80% agreement with human preferences in this complex setting.
•Careful pseudo-labeling (with position-consistency checks) and label smoothing make the big AI-labeled dataset a helpful warm-up before expert fine-tuning.
•Longer listening windows help judging; short 10‑second clips can miss structure and alignment that unfold over time.
•All data, benchmark, and models are released to push fairer and more practical evaluations for music generation.

Why This Research Matters

People don’t just want any song; they want songs that match their words, their lyrics, and even a favorite style clip. This work makes judging those songs fair and flexible, so creators and tools can trust the scores. With a compact, open model and a shared benchmark, the community can compare new systems honestly and improve faster. Musicians and producers get better first takes by letting the reward model pick the best from several generations. Listeners benefit because the music more reliably fits their intent and sounds polished. In classrooms and studios, scores can guide feedback loops that actually match human taste. And because everything is released, researchers and startups can build on it without starting from scratch.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine your music teacher asking you to compose a song that is fast, happy, uses guitar, and sounds a bit like a sample track your friend played. Now imagine a robot band tries to do that. Who decides if the robot did a good job?

🥬 Filling (The Actual Concept): Music generation AIs have gotten great at following different kinds of instructions—plain text ("upbeat rock"), full lyrics, and even a reference audio clip for style. But the judges that score these AI songs lag behind. What’s needed are judges (reward models) that can grade musicality (how good it sounds) and alignment (how well it follows instructions) across any mix of inputs. Without this, we can’t fairly compare models or pick the best take during generation. If judging is wrong or too simple, great ideas get bad scores and weak songs slip through.

🍞 Bottom Bread (Anchor): Think of a cooking show where chefs get a recipe plus a photo and a taste sample; if the judges only check the spice level, they miss presentation and texture. Music judges need to check many things too, all at once.

🍞 Top Bread (Hook): You know how some songs just feel polished—clean sounds, steady rhythm, catchy melody? That’s musicality.

🥬 Filling (The Actual Concept): Musicality is the overall sound quality and listenability of a piece: melody, rhythm, structure, instrument realism, and production clarity. A judge checks each of these, then forms an overall impression. If we skip musicality, a song that follows the rules perfectly can still sound flat, noisy, or messy—like a homework assignment done correctly but sloppily.

🍞 Bottom Bread (Anchor): Two songs might both be “fast pop,” but the one with a stable beat, clear vocals, and a memorable hook wins on musicality.

🍞 Top Bread (Hook): Imagine you draw a picture exactly like the instructions said—blue sky, red house, two trees—that’s alignment.

🥬 Filling (The Actual Concept): Instruction following (alignment) means how closely the music matches the inputs: the text prompt (genre, mood, tempo), the lyrics (words fitting rhythm and style), and the reference audio (style transfer). Judges check if the promised ingredients really show up. Without alignment, the model can drift off—asking for “lo-fi chill” but getting “EDM rave,” which confuses or disappoints users.

🍞 Bottom Bread (Anchor): If the prompt asks for “slow, sad piano with rain sounds,” but you hear a fast techno beat, alignment is clearly off.

🍞 Top Bread (Hook): You know how you might give directions using words, a doodle, and a photo example? That’s three signals combined.

🥬 Filling (The Actual Concept): Compositional Multimodal Instruction (CMI) means instructions that can mix and match text, lyrics, and reference audio. Judges must flexibly handle any subset: text-only, text+lyrics, text+audio, or all three. Without CMI-aware judging, evaluations break whenever the inputs aren’t just text.

🍞 Bottom Bread (Anchor): A singer gives lyrics and a favorite song as a style reference; the judge needs to check both the words and the style match.

🍞 Top Bread (Hook): When you don’t have enough teachers to grade homework, sometimes older students help—but you double-check their work.

🥬 Filling (The Actual Concept): In the past, music evaluation used two extremes: distribution metrics (like FAD) that compare big piles of audio, or narrow single-skill scorers (like only text–music match). These miss sample-level decisions and multimodal mixes. The paper fills that gap with large preference datasets (pairwise A vs B on the same prompt), a unified benchmark, and a compact judge model trained to reflect human taste across CMI. Without such data and a common testbed, progress is scattered and not comparable.

🍞 Bottom Bread (Anchor): It’s like moving from “How good is the whole school?” to “Which of these two essays answers this exact question better?”—much more useful for real decisions.

02Core Idea

🍞 Top Bread (Hook): You know how a science fair needs fair rules, lots of examples, and fair judges who can understand different kinds of projects?

🥬 Filling (The Actual Concept): The key insight is: Build a single ecosystem—data, benchmark, and a small unified reward model—that judges music under any mix of text, lyrics, and audio instructions, scoring both musicality and alignment like humans do.

What it is: A package called CMI-RewardBench plus two datasets (CMI-Pref-Pseudo and CMI-Pref) and a compact reward model family (CMI-RM) that handles all input combinations.
How it works: (1) Gather lots of paired comparisons across many modalities; (2) create reliable pseudo-labels using an LLM judge with position-consistency checks; (3) fine-tune on expert human labels; (4) evaluate models on a unified benchmark across regression (correlation) and pairwise accuracy; (5) use the reward model to pick the best of many generations at inference time.
Why it matters: Without a unified, CMI-aware judge, models can’t be fairly compared or steered, and users won’t reliably get music that both sounds good and follows their instructions.

🍞 Bottom Bread (Anchor): It’s like having one trusted art judge who can fairly compare pencil sketches, watercolor paintings, and clay sculptures using the same rules.

Multiple analogies:

Cooking show analogy: The prompt is the recipe (text), the plating picture (lyrics rhythm fit), and a taste sample (reference audio). The judge (reward model) must check taste (musicality) and recipe-following (alignment) no matter which combination the chef was given.
Sports referee: Different sports (text-only vs text+audio vs lyrics) still need one fair ref who knows the rules and can score plays (pairwise preferences) and statistics (correlations).
Librarian with mixed clues: To find the right book, you might give a summary (text), a quote (lyrics), and a cover image (audio style). The librarian (judge) must use any combination to pick the best match.

Before vs After:

Before: Scattered metrics, mostly single-modality, few sample-level judgments, hard to compare systems or filter outputs.
After: One benchmark, rich pairwise preferences across modalities, and a compact judge that correlates strongly with humans and improves generations via best-of-N selection.

Why it works (intuition):

Preference pairs are simple for humans (and scalable with pseudo-labeling), giving strong training signals about what people actually prefer.
Checking any subset of modalities forces the model to learn flexible alignment, not brittle one-trick patterns.
A small head on strong frozen encoders keeps the model efficient yet expressive enough to capture subtle musical and semantic cues.

Building blocks (brief Sandwich intros):

🍞 Hook: You know how you sometimes ask for a song with words, vibe, and a sample beat? 🥬 Concept: CMI (mixing text, lyrics, audio). It lets instructions be flexible; the judge must adapt. Without it, the judge breaks when inputs vary. 🍞 Anchor: A rapper provides lyrics and a reference track; the judge must check both.
🍞 Hook: When two cupcakes are close, judges taste them side-by-side. 🥬 Concept: Pairwise preferences. It compares A vs B under the same prompt. Without pairs, scoring is vague and less human-like. 🍞 Anchor: “Which of these two songs better fits ‘slow, dreamy piano’?”
🍞 Hook: If a helper misreads because the order changed, you ask twice and keep only consistent answers. 🥬 Concept: Pseudo-label position-consistency. Ask the LLM both (A,B) and (B,A) and keep only stable choices. Without this, labels are noisy. 🍞 Anchor: The same cookie should win even if you swap which plate is left or right.
🍞 Hook: Practice on lots of okay tutors, then polish with a real teacher. 🥬 Concept: Distill on big pseudo data, then fine-tune on expert labels. Without the warm-up, models overfit; without experts, they miss human nuance. 🍞 Anchor: Preseason scrimmages, then the championship.
🍞 Hook: Pick the best photo from 10 selfies. 🥬 Concept: Best-of-N (top-k filtering). Generate many, let the judge choose. Without it, you keep a random take. 🍞 Anchor: Taking 10 shots and keeping the best smile.

03Methodology

High-level flow: Input (text, optional lyrics, optional reference audio, and the candidate song) → Encoders + Transformers → Two scores: Alignment and Musicality → Use for evaluation or picking the best generation.

Step-by-step with Sandwich explanations where new ideas appear:

Build the data foundation

🍞 Hook: When training a judge, you first need many fair coin-flip questions like “A or B?”
🥬 Concept: CMI-Pref-Pseudo (110k pairs) uses a strong LLM judge (Qwen3-Omni) to label A vs B under each prompt, but only keeps answers that stay the same when the order is flipped. How it works: (a) generate pairs from many models across modalities; (b) ask the LLM to choose the better one for musicality and alignment; (c) repeat with swapped order (B,A); (d) keep only consistent results; (e) apply label smoothing during pretraining to avoid over-confidence. Why it matters: reduces bias and noise so the reward model learns stable preferences.
🍞 Anchor: If a taster likes Cookie A more than B, they should still like A if plates swap; otherwise we toss that vote.
🍞 Hook: Final judges need true experts.
🥬 Concept: CMI-Pref (4,027 expert pairs) is human-annotated, with confidence and rationales, covering text-only, with lyrics, with audio, and all-combined. How it works: (a) diverse prompts and genres; (b) equal balance of modalities; (c) experts vote on musicality and alignment separately with confidence notes. Why it matters: gives precise signals that reflect how people actually listen and decide.
🍞 Anchor: Music teachers explain why one performance wins and how sure they are.

Create a single test arena

🍞 Hook: One scoreboard for many games makes rankings fair.
🥬 Concept: CMI-RewardBench unifies PAM, MusicEval, Music Arena, and CMI-Pref tests. How it works: (a) regression tasks use correlations (LCC, SRCC, Kendall) to check trend-matching; (b) pairwise tasks use accuracy of choosing the human-preferred audio. Why it matters: compares apples-to-apples across multiple settings and input types.
🍞 Anchor: It’s like using batting average and win–loss together to judge a baseball team.

Design the small but smart judge (CMI-RM)

🍞 Hook: Two ears: one for the prompt, one for the song.
🥬 Concept: Two-tower architecture. What it is: One tower encodes the prompt parts (text, lyrics, reference audio); the other encodes the evaluation audio. How it works: (a) use frozen MuQ-MuLan encoders for text/audio; (b) fuse prompt pieces with a Prompt Transformer; (c) let a Joint Transformer compare prompt and song; (d) pool and pass to a tiny MLP head that outputs two scores: Alignment and Musicality. Why it matters: handles any mix of inputs and stays parameter-efficient (~30M), so it’s fast and flexible.
🍞 Anchor: A judge reads the brief (prompt), listens to the performance, then gives two grades.

Train in two stages

🍞 Hook: Practice with lots of drills, then polish with a coach.
🥬 Concept: Two-stage training. How it works: Stage 1 (Distill on CMI-Pref-Pseudo): learn pairwise preferences using the Bradley–Terry setup (model learns P(A>B) from score differences), apply label smoothing to fight over-confidence; early stop after ~2k steps. Stage 2 (Expert fine-tuning): mix CMI-Pref and MusicEval; keep optimizing both heads; handle both pairs (Bradley–Terry) and 1–5 ratings (via a light score mapping); early stop on validation. Why it matters: large noisy data teaches breadth; small expert data teaches depth.
🍞 Anchor: Scrimmage games build stamina; a coach session sharpens technique.

Evaluate like humans do

🍞 Hook: Do we agree with the judges in the stands?
🥬 Concept: Two protocols. How it works: Regression (PAM, MusicEval): Check correlation with human MOS to see if higher really means better. Preference (Music Arena, CMI-Pref): Check accuracy of picking the same winner as humans. Why it matters: both trend-following and head-to-head choices matter in practice.
🍞 Anchor: It’s not just your test score, it’s also who wins in a one-on-one match.

Make generation better at test time

🍞 Hook: Take 10 photos, keep the best one.
🥬 Concept: Best-of-N (top-k filtering). What it is: Generate N candidates; use the reward model’s combined score (avg of musicality and alignment) to select the best. How it works: try N=1,3,10; more candidates generally give better picks; gains can level off. Why it matters: improves quality without retraining the generator.
🍞 Anchor: Record 10 takes of a guitar riff and keep the cleanest, most on-prompt version.

Secret sauce highlights:

Position-consistency in pseudo labels cuts LLM ordering bias.
Label smoothing reduces over-confidence from pseudo data and boosts final accuracy.
Multimodal prompt context helps judge even musicality, because human taste often depends on intent (e.g., lo-fi grit can be a feature, not a bug).
Longer listening windows (up to ~120s) improve judgments that rely on structure and alignment over time.

04Experiments & Results

The test: Do CMI-RMs agree with people and help pick better music?

What they measured and why:

Correlation on absolute scores (PAM, MusicEval): If humans score one song higher, does the model’s score go up too?
Pairwise preference accuracy (Music Arena, CMI-Pref): In A vs B under the same prompt, does the model pick the same winner as human judges?
Best-of-N gains: If the reward model chooses among multiple generations, does the final music get better?

The competition:

Musicality baselines: PAM, Audiobox (CE, CU, PC, PQ), SongEval-RM.
Alignment baselines: CLAP (default/music), CLAMP3, MuQ-MuLan.
General Audio/Multimodal LLMs: Qwen2-Audio, Qwen2.5-Omni, Qwen3-Omni, Gemini 2.5 Flash/Pro, Gemini 3 Pro.

Scoreboard with context:

On CMI-Pref pairwise (hard, multimodal), CMI-RM hits about 78.2% agreement—like scoring an A when many strong students are getting Bs. Frontier MLLMs lag (e.g., Gemini 3 Pro ~65.8%, Qwen3-Omni ~60.4%).
On PAM/MusicEval correlations, CMI-RM trained with both CMI-Pref and MusicEval reaches strong rank correlations (e.g., SRCC ~0.8266 on MusicEval), rivaling or beating specialized baselines while staying small and efficient.
On Music Arena (live user comparisons), CMI-RM is competitive (~73% accuracy), and analysis shows musicality dominates user preferences overall.
Compositional alignment breakdown: CMI-RM fine-tuned on CMI-Pref shines when all modalities are present (text + lyrics + audio), reaching low-80%s accuracy—like being the only judge who can reliably grade a triathlon, not just running.

Surprising findings:

Short listening (first 10 seconds) can miss structure; longer windows (up to ~120 seconds) noticeably improve alignment and musicality judgments on longer tracks.
Adding prompt context helps even with musicality prediction—because humans judge quality relative to intent (e.g., lo-fi noise may be desirable when the prompt calls for it).
Pseudo-pretraining without label smoothing can make the model overly confident and less adaptable; smoothing fixes this and improves downstream accuracy.
General MLLMs sometimes fail not because they lack knowledge, but because they don’t follow the strict evaluator format reliably in audio tasks (e.g., they write captions instead of making a choice), sinking accuracy.

Best-of-N (top-k) results made simple:

For MusicGen-small, picking the best from 10 candidates improved both alignment and musicality metrics steadily—like taking multiple shots to get the perfect one.
For Stable Audio Open-small, gains were smaller and saturated earlier, suggesting diminishing returns when candidates are already similar.

Bottom line: A small, CMI-aware judge outperforms bigger generalists, travels well across datasets, and makes generation better simply by picking the best take.

05Discussion & Limitations

Limitations (be specific):

Coverage gaps: Even with 110k pseudo pairs and 4k expert pairs, music is vast; some styles, languages, and rare instrument combos may be underrepresented.
Dependence on frozen encoders: Using fixed MuQ-MuLan backbones limits adaptation to new sonic artifacts or novel lyric patterns; fully end-to-end finetuning could help but costs more compute.
Long-context constraints: While longer windows help, the current model still summarizes; very long songs or complex story lyrics might need stronger long-range modeling.
Pseudo-label bias: Despite position-consistency checks, AI judges can still encode hidden biases; human fine-tuning reduces but doesn’t erase them.

Required resources:

Training needs access to large paired audio corpora and multiple generators/APIs for diversity; GPU memory for transformer inference over audio chunks; storage for long-form audio and reference clips.
Inference uses modest compute (∼30M trainable head), but longer windows or mean-over-chunks increase runtime.

When NOT to use it:

Tasks outside the trained domain (e.g., spoken-word poetry with complex dialog, non-musical sound design) may confuse the musicality/alignment heads.
Legal or ethics decisions (e.g., plagiarism/copyright checks) require separate tools; the reward model is about preference, not legal similarity.
One-shot, ultra-low-latency scoring on tiny clips where musicality and alignment can’t be observed yet.

Open questions:

How to model very long structure (minutes) and narrative alignment to lyrics without losing efficiency?
Can we further debias pseudo labels (beyond position checks) or co-train with multiple independent judges for consensus?
What’s the best way to calibrate scores across genres so niche styles aren’t unfairly penalized?
Can reward models also explain their decisions to creators in musically actionable terms (e.g., “increase snare stability at 0:25–0:40”)?

06Conclusion & Future Work

Three-sentence summary: This paper builds a full ecosystem—datasets, a benchmark, and a small unified reward model—to fairly judge AI-made music under any combination of text, lyrics, and reference audio. The reward model strongly agrees with human preferences across musicality and alignment and improves generation quality at inference by picking the best of several candidates. By revealing gaps in current general-purpose judges and offering open tools, it moves music evaluation closer to real creative needs.

Main achievement: A unified, compositional, multimodal evaluation stack (CMI-RewardBench + CMI-Pref/CMI-Pref-Pseudo + CMI-RM) that delivers state-of-the-art human-aligned judging while staying lightweight and practical.

Future directions:

Longer-context architectures that track song sections, lyrical narratives, and evolving arrangements.
Richer multimodal supervision (e.g., timing-aligned lyrics, chord/structure annotations) to sharpen alignment.
Better pseudo-labeling via multi-judge ensembles and uncertainty modeling.
Creator-facing explanations that turn scores into concrete, time-stamped feedback.

Why remember this: As music AIs learn from mixed instructions the way humans do, this work gives us the fair, flexible judge we need—one that can understand intent, listen carefully, and choose the take people actually prefer.

Practical Applications

•Quality boosting at inference: Generate multiple candidates per prompt and let CMI-RM pick the best (best-of-N).
•Studio assistant: Score musicality and alignment for drafts, then iterate on the weakest aspect (e.g., tighten rhythm or better match lyrics).
•Model selection: Compare different music generators fairly under mixed inputs (text, lyrics, audio) to choose the right tool.
•Prompt engineering: Diagnose which prompt parts (genre, mood, instrumentation) the model failed to follow and adjust instructions.
•Dataset curation: Filter out low-quality or off-prompt samples before training new music models.
•Leaderboard building: Rank music systems on shared prompts using CMI-RewardBench for transparent benchmarking.
•A/B testing: Run user studies more efficiently by pre-filtering to the top candidates according to CMI-RM.
•Educational feedback: Provide students with alignment vs musicality breakdowns to focus practice.
•Pipeline safety rails: Reject generations that strongly violate requested style or contain severe audio artifacts.
•Style transfer tuning: Score how well outputs match a reference track’s style to improve audio-conditioned control.

Version: 1