HeartMuLa: A Family of Open Sourced Music Foundation Models
Key Summary
- •HeartMuLa is a family of open-source music AI models that can understand and generate full songs with clear lyrics and strong musical structure.
- •It combines four pieces: HeartCLAP (matches music and text), HeartTranscriptor (recognizes lyrics in singing), HeartCodec (turns music into efficient tokens), and HeartMuLa (the generator that makes songs).
- •A key trick is an ultra-low-frame-rate tokenizer (12.5 Hz) that makes sequences much shorter without losing musical detail, so generation is faster and more coherent.
- •The generator uses a hierarchical global–local design: a big model plans the structure and a smaller one paints the details, which keeps music consistent over minutes.
- •Users can control style, lyrics, and even different sections (intro, verse, chorus) with natural language and optional reference audio.
- •Across English, Chinese, Japanese, Korean, and Spanish, HeartMuLa achieves the lowest lyric error (PER as low as 0.09 in English) while keeping high musical quality.
- •HeartCodec sets a new bar among codecs, with top VISQOL and lowest FAD/FD, and still runs efficiently thanks to distillation and decoder fine-tuning.
- •Preference training (DPO) improves clarity, style match, and overall quality without building a separate reward model.
- •Careful engineering (KV-cache alignment, FlashAttention, CUDA Graph) speeds up generation up to 5.4× without hurting quality, enabling streaming.
- •The project is fully open-source, showing Suno-level performance can be approached with academic-scale data and GPUs.
Why This Research Matters
Open, controllable music AI lets more people create: students can learn by tweaking prompts, hobbyists can sketch songs quickly, and pros can iterate faster on ideas. Clearer lyrics across multiple languages help accessibility (e.g., better subtitles) and language learning. Efficient tokenization and inference lower compute costs, making high-quality generation practical for small teams and community labs. Transparent benchmarks and code help researchers fairly compare ideas and build on a shared foundation. Responsible design choices (e.g., avoiding timbre cloning, watermarking) show how innovation and ethics can go hand-in-hand. As music culture keeps evolving online, these tools enable new forms of collaborative creativity and global participation.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how writing a song takes both a big idea (like the story or mood) and lots of small details (like exact notes, words, and instrument sounds)? Early AI music systems could do one part or the other reasonably well, but doing both together—and for minutes at a time—was hard.
The world before: AI could make short music clips, describe sounds, or match sounds to tags. But full songs—with vocals, clear words, verses and choruses that make sense, and stylistic consistency—were rare. Many of the best systems were closed: their models, training data, or evaluation methods weren’t shared. That made it hard for students, researchers, and creators to learn, build, or improve on them. Also, many models used high-frame-rate audio tokenizers, which created super-long token sequences. That slowed generation, made it expensive, and made it tougher for AI to remember the big musical picture over time.
The problem: People want music they can guide, not just random jams. They want to say, “Make a bright pop song with a calm intro, a catchy chorus, and these lyrics,” and have the AI follow along. They also want the singing to be understandable—so the words don’t blur together. And they want all of this to work across languages and styles, from K-pop to ballads, from English to Spanish.
Failed attempts: Some systems tried semantic tokens from speech models to shrink sequences, but lost acoustic richness, so vocals and instruments sounded washed out. Others used powerful diffusion models to get high quality, but those were slow and tricky to control for long songs. Some tokenizers ran at 25–50 Hz, doubling or quadrupling the number of steps compared to a lower rate, making long-form generation costly. And text–audio alignment was often weak: the model didn’t always play what the words asked for, especially over minutes.
The gap: We needed a full, open, well-engineered toolbox that covers (1) matching music with text (so prompts really guide the sound), (2) recognizing lyrics in singing (so we can both clean data and measure clarity), (3) a tokenizer that is both compact and high-fidelity (so models can think long-term), and (4) a controllable generator that keeps structure over many minutes. We also needed solid benchmarks and clear, reproducible evaluations to compare progress fairly.
What this paper fills: HeartMuLa brings a family of models that fit together: HeartCLAP (text–audio alignment), HeartTranscriptor (lyric recognition tuned for singing), HeartCodec (a 12.5 Hz, high-fidelity tokenizer), and HeartMuLa (the hierarchical generator). They are open-source, trained and tested with transparent protocols, and support rich controls: tags, lyrics with structure labels like [intro]/[chorus], and optional reference audio embeddings. The generator can keep coherence up to six minutes and also has special modes: fine-grained section style control and short catchy pieces for videos.
Real stakes: For creators, this means faster sketching: describe your idea, paste your lyrics, set section styles, and get a coherent draft. For educators, it’s a hands-on lab for learning music AI, because the code and models are open. For accessibility, strong lyric clarity helps language learners and captioning. For developers, the efficient tokenizer and inference optimizations lower costs. For the research community, the released benchmarks and ablations make it easier to build, test, and improve future systems. And for the broader public, it shows that high-quality music AI can be reproducible and responsibly developed, including safeguards (e.g., avoiding timbre cloning by design choices around style embeddings).
02Core Idea
The “Aha!” moment in one sentence: If we compress music into smart, low-frame-rate tokens that still keep meaning, then a hierarchical language model can plan big musical arcs and fill in details—so long, controllable songs become both possible and efficient.
Three analogies:
- City planning and home decorating: The global model is the city planner (decides neighborhoods: intro, verse, chorus), the local model is the decorator (paints textures: timbre, micro-rhythms). The plan comes first; the details make it livable.
- Comic storyboard and final art: First draw panels that tell the story (global), then ink and color each panel with shading and texture (local). You get both a clear plot and beautiful frames.
- Packing for a trip: Use vacuum bags (low-frame-rate tokens) so your suitcase (the model) can carry more days of outfits (longer songs) without getting too heavy. You still unpack clean, detailed clothes (high-fidelity decoding).
Before vs. after:
- Before: High frame rates created token floods; models forgot structure; lyrics got mushy; closed systems limited learning.
- After: 12.5 Hz tokens cut sequence length; global–local modeling keeps form and detail; lyrics come out clearer; the whole stack is open and benchmarked.
Why it works (intuition):
- The tokenizer fuses multi-level features (phonetic, musical, acoustic) and then downsamples with learnable queries to 12.5 Hz. That keeps important meaning while removing redundancy.
- Residual vector quantization (multiple codebooks) packs detail efficiently, so the local model can reconstruct rich timbres.
- The generator factors the problem: first pick the base code per frame (structure/semantics), then fill in residual codes (acoustics). The model expends its “thinking budget” on what matters at each stage.
- Conditioning with lyrics, tags, and a safe style embedding grounds the plan. Preference learning (DPO) nudges it toward clearer words, better style match, and nicer sound.
- System-level inference tricks (KV-cache alignment, FlashAttention, CUDA Graph) shave off latency so long songs become practical.
Building blocks, each with a simple sandwich explanation:
🍞 Top Bread (Hook): You know how a band needs a songwriter, a lyric coach, a sound engineer, and a final producer to make a great track? 🥬 Filling (The Actual Concept): HeartMuLa (the family) is that whole band in AI form—four models that understand, compress, align, and then generate music under your directions. How it works: 1) HeartCLAP learns music–text matching; 2) HeartTranscriptor learns to read sung lyrics; 3) HeartCodec turns audio into compact tokens and back; 4) HeartMuLa (the generator) uses those tokens plus your prompts to make songs. Why it matters: Without all four, you’d have either weak control, blurry lyrics, or inefficient generation. 🍞 Bottom Bread (Anchor): You type “upbeat pop with gentle intro, these lyrics,” and HeartMuLa produces a 3-minute song with a smooth build, clear chorus words, and matching vibe.
🍞 Hook: Imagine giving different instructions to each song part: calm intro, powerful chorus, moody bridge. 🥬 Concept: Multi-condition song generation means the model follows multiple guides—lyrics, style tags, and optional reference audio—together. How it works: 1) Pack structured lyrics with [intro]/[chorus]; 2) Add style tags (genre, mood, instruments); 3) Optionally include a style embedding from a reference clip; 4) The model blends them when generating. Why it matters: Without multi-conditions, songs drift off-prompt or lose structure. 🍞 Anchor: “Lo-fi verse, rock chorus, these words”—the verse stays chill; the chorus turns loud and guitar-heavy while singing your lines.
🍞 Hook: Think of making a flipbook: fewer pages still show the scene if each page is well chosen. 🥬 Concept: A low-frame-rate music codec represents audio using far fewer steps per second (12.5 Hz) but keeps meaning. How it works: 1) Extract rich features; 2) Learn to summarize pairs of frames; 3) Quantize with multiple small codebooks; 4) Reconstruct with a strong decoder. Why it matters: Fewer steps mean faster, longer, more stable generation. 🍞 Anchor: A 2-minute song that once needed ~6,000 steps might drop to ~1,500, yet still sounds full and clear.
🍞 Hook: Like zipping a giant music file into a tiny folder you can still turn back into music. 🥬 Concept: HeartCodec is the compressor–decompressor for music tokens. How it works: 1) Combine features from music and speech encoders; 2) Downsample to 12.5 Hz with learnable queries; 3) Quantize via residual vector quantization; 4) Decode with a flow-matching model into continuous latents and then into waveform. Why it matters: Without HeartCodec, the generator would be slow or lose detail. 🍞 Anchor: You feed HeartCodec tokens to the decoder and get back a high-fidelity stereo track.
🍞 Hook: Matching a song to the right mood words is like pairing the right caption to a photo. 🥬 Concept: HeartCLAP aligns music and text in the same space so the model knows which sounds fit which words. How it works: 1) Encode music and text; 2) Pull true pairs close, push mismatches apart (contrastive learning); 3) Learn a shared embedding. Why it matters: Without it, prompts wouldn’t reliably steer sound. 🍞 Anchor: Search “warm, acoustic, cozy” and HeartCLAP helps find matching tracks.
🍞 Hook: Typing “find this picture’s description” feels like magic; doing it for songs is similar. 🥬 Concept: Cross-modal retrieval finds matches between different kinds of data (text and audio). How it works: 1) Put both in one embedding space; 2) Compare by similarity; 3) Return the closest match. Why it matters: Helps both evaluate and condition generation. 🍞 Anchor: Type “sad piano ballad” and get songs that fit.
🍞 Hook: When you write a story, you predict the next sentence from what you wrote before. 🥬 Concept: Autoregressive modeling predicts the next audio token from previous ones. How it works: 1) Read context tokens; 2) Predict the next code; 3) Append and repeat; 4) Build frames, then full audio. Why it matters: This is how the generator builds long music step by step. 🍞 Anchor: The model writes the chorus note by note, guided by prior bars and your lyrics.
🍞 Hook: Learning differences is like sorting cats vs. dogs by seeing many pairs. 🥬 Concept: Contrastive learning teaches models by bringing true pairs closer and pushing false pairs apart. How it works: 1) Create batches of correct and incorrect pairs; 2) Optimize a loss that increases the true-pair similarity; 3) Repeat across data. Why it matters: Stronger text–music alignment boosts control. 🍞 Anchor: “Rock, electric guitar” pairs with crunchy riffs, not flutes.
🍞 Hook: Counting misheard sounds in singing is like counting mispronounced words. 🥬 Concept: Phoneme Error Rate (PER) measures how clearly the model sings the intended sounds. How it works: 1) Separate vocals; 2) Transcribe with HeartTranscriptor; 3) Compare phonemes to the reference lyrics; 4) Lower is better. Why it matters: High PER means mushy or wrong lyrics. 🍞 Anchor: English PER 0.09 beats 0.13, meaning clearer words in the chorus.
03Methodology
At a high level: Input (lyrics + tags + optional reference audio) → HeartCodec turns training audio into tokens → HeartMuLa Global model plans structure (layer-0 tokens) → HeartMuLa Local model paints details (residual tokens) → HeartCodec decodes tokens back to waveforms → Output music.
Step 1. Tokenize music with HeartCodec
- What happens: HeartCodec compresses 48 kHz stereo into discrete tokens at 12.5 Hz using multi-encoder features (Whisper, WavLM, and a fine-tuned music encoder), learnable query downsampling, and residual vector quantization (8 codebooks Ă— 8192). A flow-matching decoder maps tokens to continuous latents (from a 25 Hz SQ-Codec) and then to waveform. ReFlow distillation cuts sampling steps from 50 to 10, and decoder fine-tuning boosts fidelity.
- Why this step exists: It makes sequences short and expressive, so generation is efficient and detailed. Without it, the generator would be too slow or lose rich timbre.
- Example: A 30-second pop clip becomes a short sequence of RVQ tokens. Decoding from those tokens reconstructs the original feel, vocals, and instruments with high VISQOL and low FAD/FD.
Step 2. Prepare conditions (lyrics, tags, optional reference)
- What happens: Lyrics include structure tags like [intro], [verse], [chorus] and are tokenized (Llama tokenizer). Tags list genre, mood, instruments, etc., with category sampling that favors impactful tags (e.g., genre). If provided, a 10 s reference is embedded by MuQ-MuLan (no timbre cloning) as a global style cue. These are concatenated into a condition sequence C.
- Why this step exists: It anchors the music plan to your intent. Without it, the model might drift away from your style or structure.
- Example: C = [“genre: pop, mood: joyful, instruments: piano, strings”, [intro]/[chorus] lyrics, reference style embedding].
Step 3. Global–local hierarchical generation
- What happens: Using tokens A with K codebooks per frame, the Global transformer predicts the base token a_l,0 for each frame, conditioned on past frames and C. The Local transformer then predicts the residual tokens a_l,1..K-1 for that same frame, conditioned on local context plus the Global hidden state. Training minimizes a weighted sum of cross-entropy losses: equal weights early; later, upweight layer-0 to emphasize structure.
- Why this step exists: It splits planning (structure/semantics) and painting (acoustics), which is easier to learn and faster to run. Without it, a single model would struggle to juggle global arcs and micro-details.
- Example: For frame t in the chorus, Global picks a layer-0 token that sets “big pop hook here,” and Local adds crisp vocals and bright synth texture via residual codes.
Step 4. Progressive training (4 stages)
- Warmup: Short 30 s clips, lyrics + reference only. Purpose: quick convergence and learning local acoustics.
- Pretraining: Full songs, lyrics + tags + reference. Purpose: learn long-range structure and multi-condition control.
- Supervised fine-tuning (SFT): High-quality subset filtered by AudioBox and SongEval; increase weight on layer-0 loss. Purpose: polish audio quality and structure.
- Direct Preference Optimization (DPO): Use preference pairs focused on PER, tag similarity, and AudioBox/SongEval to steer the model without explicit reward models. Purpose: improve clarity, style match, and overall quality in a stable, sample-efficient way.
- Why this design: Each stage targets a different skill. Without it, the model might be good locally but weak globally, or vice versa.
- Example: After SFT, songs sound nicer and more coherent; after DPO, lyrics get noticeably clearer and style match tightens.
Step 5. Decode and play
- What happens: The predicted RVQ tokens go to HeartCodec’s decoder (flow-matching + SQ-Codec) to produce a 48 kHz stereo waveform. Guidance scale is tuned (e.g., 1.25) for natural sound.
- Why this step exists: It turns discrete decisions back into high-fidelity audio. Without a strong decoder, tokens would not become enjoyable music.
- Example: A 3-minute song with smooth transitions and clear vocals comes out, ready for listening.
Inference acceleration (the practical magic):
- KV-cache alignment: Ensure keys/values and positions are perfectly synced across steps using tensor-only ops; avoid Python-side scalar breaks. This enables real cache reuse.
- FlashAttention: Use memory-efficient attention kernels over just the valid prefix to cut kernel launches and keep speedups steady as sequences grow.
- CUDA Graph: Capture static parts of the forward pass to replay with minimal Python overhead; keep sampling and dynamic parts outside the graph.
- Why it matters: Long songs mean many steps; shaving per-step cost enables interactive and streaming use. Without these, wait times balloon and demos feel sluggish.
- Example: End-to-end latency drops from ~398 s to ~73 s (5.4Ă— faster) without hurting quality; streaming mode hits ~68 s and best PER.
Secret sauce summary:
- Ultra-low 12.5 Hz tokens that still carry semantics.
- Hierarchical global–local modeling that mirrors how musicians plan and perform.
- Preference alignment (DPO) aimed at clarity, style, and quality.
- System co-design that turns research ideas into practical, fast generation.
04Experiments & Results
What they tested and why:
- Music quality and structure: AudioBox (CE, CU, PQ) and SongEval (Coherence, Musicality, Memorability, Clarity, Naturalness). These judge if songs sound good and are well organized.
- Style adherence: Tag Similarity (cosine between generated audio embedding and target tags). This checks if the music matches the requested vibe.
- Lyric intelligibility: Phoneme Error Rate (PER). This measures how clearly the AI sings the intended sounds.
- Codec reconstruction: VISQOL (perceptual quality), FAD/FD (distribution match), STOI/PESQ (intelligibility), SPK_SIM (speaker similarity), WER (word error rate via HeartTranscriptor). These show token quality and decoding fidelity.
- Efficiency: Real-time factor (RTF), latency, and kernel launches to show practicality.
Who they compared against:
- Closed-source generators: Suno-v5, Suno-v4.5, Mureka-V7.6, Udio-v1.5, MiniMax-Music-2.0.
- Open-source generators: LeVo, YuE, DiffRhythm 2, ACE-Step.
- Codecs: SemantiCodec, XCodec, MuCodec, LeVo.
Scoreboard highlights with context:
- HeartMuLa achieved lowest PER across five languages. English PER = 0.09 (compared to 0.13 for Suno-v5 and 0.13 for MiniMax-2.0). Think of PER like reading lyrics out loud without stumbling; 0.09 is like a strong A when others are closer to B.
- Musical quality was consistently high: On the English HeartBeats benchmark, SongEval averages around 4.48 out of 5, near top-tier systems, with especially strong structure and naturalness. Across Chinese, Japanese, Korean, and Spanish, performance stayed steady, unlike many open-source baselines that dip outside English.
- Style adherence (Tag-Sim) was competitive (around 0.26 on English), and DPO variants could push it higher (Muq-DPO peak ~0.284), showing prompts reliably steer the vibe.
Codec results:
- HeartCodec set new marks: top VISQOL and lowest FAD/FD among compared codecs, while keeping STOI/PESQ in the top tier. After ReFlow and SQ-decoder fine-tuning, metrics improved further, confirming both speed and fidelity gains.
- Latent choice matters: Using SQ-Codec latents outperformed 1D VAE and Mel VAE in subjective tests for Vocal Similarity and Melody Naturalness, balancing quality and speed. Mel VAE had slower RTF (~0.397), making it impractical for streaming.
Ablations worth noting:
- ReFlow distillation and SQ-decoder fine-tuning not only improved codec metrics but also lifted downstream generation scores (AudioBox, SongEval, Tag-Sim), proving that better detokenization translates into better music.
- Guidance scale: While a scale of 1.5 improved some objective metrics, 1.25 sounded more natural to human listeners (smoother vocals and mix), so the default favors listening comfort.
- Training stages: Pretrain → SFT → DPO steadily improved results. For English, going from SFT to DPO further dropped PER (from 0.1005 to 0.0687) and maintained high quality.
Inference efficiency:
- With KV-cache, FlashAttention, and CUDA Graph, end-to-end latency fell ~5.4× (398 s → 73 s), and streaming achieved ~68 s with the best PER (0.0778). Quality metrics stayed within expected variance, confirming no trade-off in fidelity.
Surprises and takeaways:
- A 3B global + 300M local model, trained on academic-scale resources, can approach commercial-grade performance when paired with the right tokenizer and training recipe.
- Dropping to 12.5 Hz did not wreck detail; with multi-level features and strong decoding, quality held up and sometimes surpassed baselines.
- DPO with dimension-specific preference sets (PER, style, quality) works: each target improves without building a heavy reward model.
Bottom line: Clearer lyrics, solid musicality and structure, reliable style following, strong codec fidelity, and practical speed—all in an open-stack package.
05Discussion & Limitations
Limitations:
- Exact voice timbre cloning is intentionally avoided (by using MuQ-MuLan style embeddings without timbre). If your goal is precise singer imitation, this system won’t do that.
- Extreme polyphony or unusual production effects can still trip up lyric clarity or balance. While PER is low on average, difficult mixes may blur some syllables.
- Style tags guide global vibe; fine control over micro-arrangement (e.g., precise drum fills, chord substitutions) still requires iteration or future controls.
- Real-time on very small devices remains challenging; the system is optimized for GPUs, not edge hardware.
- Data coverage: While multilingual, performance outside the five evaluated languages is less certain and may need targeted fine-tuning.
Required resources:
- Training used tens of A100 GPUs across stages; reproducing full training is substantial. Inference is much lighter but still benefits from a modern GPU for long-form generation.
- Datasets: high-quality, well-aligned lyrics–audio pairs and style annotations matter; HeartTranscriptor and filtering pipelines help but are part of the needed stack.
When not to use:
- If you must clone a specific singer’s timbre (not supported by design).
- If you need deterministic, measure-by-measure music theory adherence (e.g., strict counterpoint) without post-editing.
- If deployment must run on low-power mobile CPUs in real time.
Open questions:
- Safety and watermarking: How to standardize robust, lossless provenance marks across platforms?
- Finer control: Can we expose tempo curves, chord progressions, and arrangement tracks as editable controls while staying simple for users?
- Co-creation loops: What’s the best way to let humans and the model trade ideas live (DAW integration, section-by-section refinement)?
- Fairness and coverage: How to broaden cultural styles and languages further while keeping alignment and lyric clarity strong?
- Theory-aware modeling: Can explicit harmony/rhythm constraints or symbolic guides push structure even higher without killing creativity?
06Conclusion & Future Work
Three-sentence summary: HeartMuLa is an open family of music foundation models—HeartCLAP, HeartTranscriptor, HeartCodec, and a hierarchical generator—that together understand prompts, compress music into smart tokens, and generate long, controllable, high-fidelity songs. The key is a 12.5 Hz, semantic-rich tokenizer plus a global–local language model trained with progressive stages and preference alignment, delivering clear lyrics, coherent structure, and practical speed. Extensive benchmarks show state-of-the-art codec quality and competitive song generation across five languages, all in an open, reproducible stack.
Main achievement: Proving that Suno-level, commercial-grade capabilities can be approached with an open, academic-scale pipeline by co-designing an ultra-low-frame-rate tokenizer and a hierarchical, multi-condition generator, then engineering it for speed without sacrificing quality.
Future directions:
- Scale the global backbone (e.g., 7B) and explore mixture-of-experts for efficiency.
- Add musician-friendly controls (chords, tempo maps, stems) and tighter DAW integration.
- Expand multilingual coverage and style catalogs with richer evaluations.
- Strengthen safety and watermarking standards and provide creator tools for attribution.
Why remember this: It’s a blueprint for practical, controllable song generation—compress smartly, plan globally, paint locally, align with human preferences, and engineer for speed—delivered openly so the community can learn, build, and make more music together.
Practical Applications
- •Songwriting assistant: Paste structured lyrics, set section styles, and generate a multi-minute demo to refine.
- •Video background music: Use the short, engaging mode to quickly produce catchy, loopable tracks.
- •Genre exploration: Try the same lyrics with different tag sets (e.g., lo-fi, rock, EDM) to find the best fit.
- •Multilingual demos: Draft songs in English, Chinese, Japanese, Korean, or Spanish with clear lyrics.
- •Music education: Show how structure tags ([intro], [verse], [chorus]) affect musical form in practice.
- •A/B style testing: Compare generations using different tag combinations and pick the one with higher Tag-Sim.
- •Data cleanup: Use HeartTranscriptor to filter misaligned lyric–audio pairs when building training sets.
- •Library search: Use HeartCLAP embeddings to organize and retrieve music by natural-language queries.
- •DAW prototyping: Generate sections separately (intro/verse/chorus) and stitch them into production sessions.
- •Rapid iteration: Adjust tags or a short reference and regenerate to fine-tune the vibe without starting over.