SAM Audio: Segment Anything in Audio

Bowen Shi; Andros Tjandra; John Hoffman; Helin Wang; Yi-Chiao Wu; Luya Gao; Julius Richter; Matt Le; Apoorv Vyas; Sanyuan Chen; Christoph Feichtenhofer; Piotr Dollár; Wei-Ning Hsu; Ann Lee

SAM Audio: Segment Anything in Audio

Intermediate

Bowen Shi, Andros Tjandra, John Hoffman et al.12/19/2025

arXiv PDF

Key Summary

•SAM Audio is a new AI that can pull out exactly the sound you want from a noisy mix using text, clicks on a video, and time ranges—together or separately.
•It uses a generative engine (a diffusion transformer with flow matching) that builds the cleaned sound step by step instead of just erasing parts.
•You can say 'dog barking,' click on the dog in the video, or mark 1–2 seconds on the timeline—doing more than one makes the result even cleaner.
•The model works in a compact 'latent' audio space (DAC-VAE) so it keeps sound quality high while running fast.
•A helper model predicts time spans from text, which boosts text-only prompting without any extra human labels.
•It separates both the target track and the 'everything else' track at the same time, so you can either isolate or remove sounds.
•On many tests across speech, music, instruments, and everyday sounds, it beats both general-purpose and specialized systems.
•A new benchmark (SAM Audio-Bench) measures text, visual, and time prompts on real, in-the-wild audio and video.
•A new judge model (SAJ) scores separation quality without reference tracks and matches human opinions closely.
•It can handle long recordings smoothly using a multi-diffusion stitching trick that avoids audible seams.

Why This Research Matters

Audio is everywhere—classrooms, meetings, music, videos—and it’s often messy. SAM Audio lets anyone say what they want to hear, point to where it is, and mark when it happens, then instantly get that sound cleanly. This helps students understand lessons better, creators remix music and videos, and people with hearing challenges focus on the right voice. It saves time for editors who used to struggle with complex software or expensive studio tools. It also gives researchers and developers a fair, realistic way to test audio tools, even when no clean reference tracks exist. In short, it puts studio-grade control into everyday hands.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how in a classroom, lots of kids talk at once and it’s hard to hear your friend? Now imagine your computer trying to pick out just your friend’s voice from all that noise.

🥬 The World Before: For years, AI learned to split mixed audio into pieces—like separating vocals from drums, or one speaker from another. These systems were great when the categories were fixed: 'always give me vocals, drums, bass' or 'always give me two speakers.' But real life isn’t that simple. Sounds overlap, rules change, and people want to pull out very specific things, like 'the squeak of a chair' or 'the girl on the left speaking.' Before this paper, many models either worked only for a few fixed categories (like music stems) or only took one kind of instruction (often just text), and they didn’t generalize well to tricky cases like professionally mixed music or crowded conversations.

Failed Attempts: People first tried 'promptless' methods: the model splits into a fixed set of parts every time. That worked for narrow jobs (e.g., speech enhancement), but fell apart when users asked for something outside the fixed list. Then came 'text-prompted' systems that listen for 'dog barking' or 'woman speaking.' That opened the door to any sound you can describe, but text alone struggled in some cases: How do you separate 'the boy’s voice on the right' if there are two boys? Or separate two nearly identical whooshes in an action scene? Visual-prompted models helped some (click the dog), but those were often tested on small or synthetic datasets, and real videos have off-screen sounds and messy scenes. Also, we lacked good, fair tests: music benchmarks used studio stems with fixed categories; text-prompted models used unrelated test sets; and common metrics like SDR didn’t match what people actually heard.

The Problem: We needed one flexible system that: (1) accepts different types of instructions—text (what), visual masks (where), and time spans (when); (2) works across many domains—speech, music, instruments, and everyday sounds; (3) stays faithful to the original sound quality; and (4) can be judged fairly, even when no clean reference tracks exist.

The Gap: No existing model unified text, visual, and time prompts in a single framework at scale, while matching or beating the best domain specialists. Also, the field lacked a realistic, multimodal benchmark and a trustworthy, reference-free scoring method that aligned with human ears.

Real Stakes: Why care? Because this is everywhere in life: making Zoom calls clearer, helping hearing aids focus on the right voice, isolating instruments for music education and remixing, removing background music from a vlog, highlighting the squeal of a car part for diagnostics, or cleaning archival audio. In classrooms, studios, and phones, people want to say 'that one, right there, at this time'—and get it, reliably.

🍞 Anchor: Imagine editing a school talent show video. You want only the piano, not the crowd. You type 'piano,' click the pianist, and select the 0:10–0:30 part. The system returns a clean piano track—and another track with everything else—ready for your edit.

02Core Idea

🍞 Hook: Imagine you’re wearing smart headphones during recess. You can whisper three clues to them—what to hear (text), where it is (point to it), and when it happens (time)—and they lock onto that sound perfectly.

🥬 The 'Aha!' Moment (one sentence): Combine three simple kinds of prompts—text, visual masks, and time spans—inside one generative model so it can separate any target sound, in any domain, with precision.

Multiple Analogies:

Treasure hunt: Text is the map label ('gold coins'), the visual mask is the X on the map (where), and the time span is 'only search at sunset' (when). Using all three makes finding treasure easy.
Cooking show: Text says 'chop the onions' (what), the camera highlights the onions on the cutting board (where), and the timer says 'do it in minute 2–3' (when). The meal turns out right because the instructions are complete.
Librarian: You ask for 'books about whales' (what); you point to the shelf (where); and say 'from the 1990–2000 section' (when). The librarian finds the exact book fast.

Before vs After:

Before: You usually picked just text or had fixed categories. It was easy to confuse lookalike or soundalike things and hard to handle real-world messiness.
After: You can combine what-where-when. The model separates subtle and overlapping sounds, even in professional music and noisy scenes, and also returns a clean 'everything else' track.

Why It Works (intuition, no equations):

Generative separation treats the target as something to 'rebuild' carefully, not just to 'erase' with a blunt mask. That preserves tone and texture.
Flow matching with a diffusion transformer learns a smooth path from noisy guesses to clean audio, guided by your prompts.
Working in a compact latent space (DAC-VAE) carries rich detail with fewer steps, so quality stays high and speed stays reasonable.
Time spans steer attention to the right moments; visual masks ground the right object; text sets the meaning. Together, they remove ambiguity.

Building Blocks (in simple pieces, each introduced with a mini 'sandwich'):

🍞 You know how giving written directions helps friends find your house? 🥬 Text Prompts: What it is: Short phrases like 'dog barking' that tell the model what sound to pull out. How it works: (1) Read the text, (2) turn it into features, (3) guide the model to favor audio matching those features. Why it matters: Without it, the model guesses blindly about 'what' to find. 🍞 Anchor: Type 'female speech' to extract her voice in a noisy café.

🍞 Imagine you circle the exact toy you want in a photo to show your parent. 🥬 Visual Prompts: What it is: A mask over the video that shows where the sound source is. How it works: (1) You click to select the object, (2) the model reads features from those pixels, (3) it links that region to the matching sound. Why it matters: Without 'where,' two similar sounds (two guitars) get mixed up. 🍞 Anchor: Click the drummer so you get just the drums.

🍞 Think of a referee’s whistle that only matters when it blows, not all the time. 🥬 Temporal Span: What it is: Time intervals that mark when the target sound is active. How it works: (1) Turn spans into a frame-by-frame on/off signal, (2) feed it alongside audio, (3) the model focuses on those moments. Why it matters: Without 'when,' overlapping sounds are harder to tease apart. 🍞 Anchor: Mark 1–2 seconds to grab only the door slam.

🍞 Picture a secret shorthand note that stores a big story in a tiny space. 🥬 Latent Representation (for audio): What it is: A compact way to store the important parts of a sound. How it works: (1) Encode the waveform into smaller feature tokens, (2) do the hard work there, (3) decode back to audio. Why it matters: Without latents, the model is slower and can lose detail. 🍞 Anchor: Zip a giant song file to make it easier to send, then unzip later.

🍞 Imagine packing a suitcase smartly so everything fits without wrinkles. 🥬 DAC-VAE: What it is: The special audio suitcase (encoder–decoder) used here. How it works: (1) Encode audio into smooth, Gaussian-like features, (2) let the separation model work in that space, (3) decode clean results. Why it matters: Without DAC-VAE, the model would be bulkier and less precise. 🍞 Anchor: It’s like neatly rolling clothes so you can pack more and find things faster.

🍞 Picture a teacher who listens to your words, watches your pointing, and checks the time you say—then helps exactly with that part. 🥬 Multimodal Prompting: What it is: Using text (what), visual masks (where), and spans (when) together. How it works: (1) Read all prompts, (2) align them with the audio, (3) separate the matching source. Why it matters: Without combining modes, tricky scenes stay ambiguous. 🍞 Anchor: Say 'piano,' click the piano, and select 10–20 s; you get a perfect piano stem.

🍞 Think of a sculptor who starts with a rough block and slowly chisels it into a statue. 🥬 Diffusion Transformer: What it is: A model that improves a rough guess step by step. How it works: (1) Start from a noisy latent, (2) predict how to nudge it toward the right sound, (3) repeat for a few steps. Why it matters: Without gradual refinement, details and realism suffer. 🍞 Anchor: Each step polishes the piano tone a bit more until it sounds right.

🍞 Imagine charting a smooth path down a mountain instead of bouncing randomly. 🥬 Flow Matching: What it is: A way to learn the best gentle path from noise to clean sound. How it works: (1) Learn tiny velocity steps, (2) integrate them over time, (3) arrive at the separated audio. Why it matters: Without a smooth path, results can be slow or gritty. 🍞 Anchor: It’s like planning switchbacks that safely guide you downhill.

🍞 Suppose you underline only the lines in a script when your character speaks. 🥬 Span Prompting: What it is: Feeding the model those underlined time ranges. How it works: (1) Turn spans into on/off tokens, (2) add them to each frame, (3) guide the model’s focus. Why it matters: Without it, you might also capture others’ lines. 🍞 Anchor: Grab only the 'meow' moments from a busy street.

🍞 Think of a librarian who checks if a book matches a topic card. 🥬 CLAP (text–audio similarity): What it is: A tool that checks if audio and text belong together. How it works: (1) Embed text and audio, (2) compare them, (3) higher means better match. Why it matters: Without it, auto-captioned prompts might be off-topic. 🍞 Anchor: 'Dog barking' should match a bark, not a bell.

🍞 Picture a bridge that lets pictures and sounds meet in the middle. 🥬 ImageBind: What it is: A model that aligns images (or masked image regions) and audio. How it works: (1) Embed both, (2) measure how close they are, (3) higher means better audio–visual match. Why it matters: Without it, visual cues might not match the sound you extracted. 🍞 Anchor: A highlighted violin region should match a violin sound.

🍞 Anchor: Put it all together: type 'violin,' click the violinist, mark 5–12 seconds, and get a clean violin track, even in a busy orchestra.

03Methodology

🍞 Hook: Imagine making a smoothie. You add fruit (the input audio), say what flavor you want (text), point at the exact fruit (visual), and choose the blending time (span). Then the blender makes exactly the taste you asked for—and also saves a separate jar with 'everything else.'

🥬 Overview (high level): Input mixture → encode to latent audio → read prompts (text, visual, spans) → diffusion transformer with flow matching refines target and residual together → decode both tracks → output: target stem + residual stem.

Step-by-step (like a recipe):

Encode the mixture into latents (DAC-VAE).

What happens: The raw waveform becomes a compact sequence of features at ~25 frames per second. This keeps audio quality high while making computation efficient.
Why it exists: Without compact latents, the model is slower and more likely to lose crisp details.
Example: A 10-second concert clip turns into neat feature tokens ready for smart processing.

Read the prompts.

Text (what): A T5 encoder turns 'woman speaking' into features. These feed cross-attention inside the transformer.
Visual (where): A visual encoder (PE) reads only the masked region you selected (from SAM2), frame-aligned to the audio.
Span (when): Time intervals become a frame-by-frame sequence of on/off tokens (active vs silent).
Why it exists: Each prompt adds a different clue. Without any one of them, ambiguity rises.
Example: Text says 'piano,' the mask highlights the pianist’s hands, and the span marks 8–15 s.

Join target and residual in one pass.

What happens: The model represents the target and the 'everything else' together by concatenating their latents. It predicts both at the same time.
Why it exists: Joint prediction helps keep perfect bookkeeping (target + residual ≈ mixture) and improves consistency.
Example: You get both 'piano-only' and 'non-piano' tracks that add back to the original.

Diffusion transformer with flow matching refines audio.

What happens: Start from a noisy guess in latent space, then repeatedly nudge it toward the clean target/residual using a learned 'velocity' at each small step.
Why it exists: Gentle, learned steps preserve timbre and detail better than blunt masking.
Example: After a handful of steps, the piano tone becomes bright and clear, without wiping out natural room sound.

Auxiliary alignment to 'think about the right thing at the right time.'

What happens: An extra head nudges internal features to match an Audio Event Detection (AED) model’s embeddings of the true target over time.
Why it exists: It teaches the model 'what and when' the event is, improving prompt-following and timing.
Example: When the dog barks at 1.2 s, the model’s hidden state lines up with 'bark' features right then.

Span prediction that boosts text-only prompting.

What happens: A helper model (PE-A-Frame) reads the mixture + your text (e.g., 'violin') and predicts which frames likely contain that sound. The system turns these into a span sequence and feeds it alongside the text.
Why it exists: Humans rarely label exact start/stop times, but spans make separation cleaner. Prediction gives the benefit without extra labeling cost.
Example: Text 'snare drum' + predicted spikes around 5–7 s = crisp snare extraction.

Long recordings with multi-diffusion.

What happens: For long audio, the model processes overlapping windows in parallel and blends them at every refinement step using soft masks.
Why it exists: Chunking alone causes seams; one-shot runs out of memory. Multi-diffusion keeps things smooth and scalable.
Example: A 1-minute band rehearsal comes out coherent, with no clicks between segments.

Candidate reranking for extra polish.

What happens: The system can generate multiple candidates and pick the best using a combo of the SAM Audio Judge (SAJ) and CLAP/ImageBind scores.
Why it exists: It squeezes extra quality out by preferring outputs that sound right and match prompts well.
Example: Among 8 candidates, the one that best matches 'saxophone' and sounds cleanest wins.

Secret Sauce (what’s clever):

Unified prompting: text (what), visual (where), and span (when) in one backbone means fewer confusions and better control.
Joint target+residual generation: Guarantees clean add-up and enables both isolation and removal without retraining.
Flow matching in DAC-VAE space: Keeps detail and speed balanced; smooth paths lead to natural-sounding results.
Span prediction from text: Upgrades text-only use without costly human timing labels.
Multi-diffusion for long audio: Prevents boundary artifacts while staying memory-friendly.

Mini 'sandwich' explainers for the extra tools:

🍞 Ever listen to a nature CD and try to name each bird call as it happens? 🥬 Audio Event Detection (AED): What it is: A system that tags which sound events happen when. How it works: (1) Hear audio, (2) produce time-aligned feature tags, (3) help other models learn timing. Why it matters: Without AED-like guidance, timing can drift. 🍞 Anchor: It flags 'bark at 1.2 s,' guiding the separator.

🍞 Picture a fast-forward button that jumps through a movie in smooth steps. 🥬 Multi-diffusion: What it is: Overlap-and-blend windows at every refinement step for long audio. How it works: (1) Split with overlaps, (2) process all windows in sync, (3) softly merge at each step. Why it matters: Without it, you hear seams between chunks. 🍞 Anchor: A 60-second song stays seamless end-to-end.

🍞 Think of a fair judge at a talent show who doesn’t need the original sheet music to tell if a performance was faithful. 🥬 SAM Audio Judge (SAJ): What it is: A model that scores how well the separated audio fits the prompt—no reference needed. How it works: (1) Read input, output, and text, (2) predict quality on axes like recall, precision, faithfulness, overall, (3) correlate strongly with human ratings. Why it matters: Without SAJ, many real audios can’t be fairly scored. 🍞 Anchor: It can tell if 'violin' output really sounds like the violin from the mix.

🍞 Anchor: End-to-end, it’s like telling a super-smart audio blender: 'Extract the violin here, at this time,' and it hands you two bottles—pure violin and everything else—ready for your project.

04Experiments & Results

🍞 Hook: Imagine a science fair where each project is tested by real people, not just rulers and calculators, and the winning project works on everything from quiet libraries to rock concerts.

🥬 The Test: The team built SAM Audio-Bench, a big, realistic test with 10-second clips from real videos and audios (e.g., AudioSet, VGGSound, MUSIC, AVSpeech, CondensedMovies). Each clip comes with three prompt types: human-written text, human-drawn visual masklets (when the source is on screen), and human-labeled time spans. They measured performance in three ways: (1) human listening scores (overall quality and preferences), (2) SAM Audio Judge (reference-free, human-aligned scoring), and (3) alignment metrics like CLAP (text–audio) and ImageBind (audio–visual).

The Competition: SAM Audio faced strong general-purpose systems (AudioSep, FlowSep, SoloAudio, CLAPSep) and specialized models (for instruments: Demucs, Spleeter; for speech/speakers: MossFormer2, Tiger, FastGeCo; for commercial tools: AudioShake, MoisesAI, LalalAI, etc.). Visual baselines included DAVIS-Flow and CLIPSep, plus speaker-specific visual baselines.

Scoreboard with context:

Text-prompted separation (general sounds, speech, speakers, music, instruments): SAM Audio’s human scores were much higher—like getting an A+ where others got B’s or C’s. On instrument separation with professional music (MUSDB-style), it even edged out strong commercial tools.
Visual-prompted separation: SAM Audio topped general visual baselines and matched or beat speaker/instrument-focused visual systems, especially when you needed instance-level control (e.g., two similar instruments on screen).
Span prompting: Using spans alone helped for short, spiky sounds (like impacts) but not for long, continuous sounds (like speech/music). However, combining text + spans consistently boosted scores across domains. Predicted spans (from text) nearly matched ground-truth spans, delivering gains without extra labeling.
Removal mode (residual): On music removal, SAM Audio outperformed existing commercial baselines and mirrored its strong extraction results—because it generates target and residual together.

Numbers made meaningful:

Human Overall scores (OVR) for SAM Audio were around the 4+ range (out of 5) across many tasks, while several baselines sat closer to the 2–3 range for tough categories—think 'solid A' versus 'C+/B−.'
Net Win Rate (NWR) comparisons showed SAM Audio beating others broadly, even challenging specialized domain leaders (e.g., instrument separation) and doing substantially better in mixed real-world scenes.
SAJ (the judge model) correlated strongly with human scores (PCC often >0.8), while older metrics like SDR proxies lagged, proving SAJ is a useful stand-in when references don’t exist.

Surprising Findings:

Visual prompting works—but text prompting often scored higher because good, large-scale text–audio data is easier to gather than perfect audio–visual pairs. Still, visual masks were crucial when text was ambiguous (e.g., two male speakers talking at once).
Predicted spans (from text) delivered most of the benefit of manually labeled spans, so everyday users can get cleaner results without doing extra timing work.
The model stayed strong even with fewer refinement steps (fast mode), especially for short, sparse sounds, showing the mixture itself is a powerful guide.
Long recordings stayed smooth with multi-diffusion, avoiding the 'seams' you often hear when stitching chunked audio.

🍞 Anchor: Think of a band rehearsal video: You type 'violin,' draw a quick mask over the violinist, and select 10–20 seconds. Listeners prefer SAM Audio’s output, the judge model agrees, and even the alignment scores say 'yes, that’s the violin.'

05Discussion & Limitations

🍞 Hook: If a magic magnifying glass helps you read tiny text, you’ll soon notice it still struggles on smudged pages. Even great tools have limits—and knowing them helps you use them wisely.

🥬 Honest Assessment: Limitations (be specific):

Visual prompting is weaker than text prompting today because real videos often have off-screen sounds, noisy masks, or mismatched timing. More robust audio–visual grounding is needed.
General sound effects are harder than focused domains (like speech) because they’re diverse, overlap often, and can sound very similar (e.g., different whooshes). Subtle distinctions remain tricky.
Highly overlapped, same-type sources (two identical instruments playing together) can still be challenging, even with visual help, especially if the object is partially visible or off-screen.
Extreme studio tricks (heavy reverb, aggressive mastering) can blur source boundaries, making perfect separation difficult.

Required Resources:

A GPU for fast inference (especially for large models) and the visual/text encoders. Long-form multi-diffusion benefits from memory and parallel compute.
Optional reranking (SAJ + CLAP) adds a small compute overhead but improves quality.

When NOT to Use:

If you only need simple noise reduction for a single speaker and have a tiny device budget, a lightweight speech enhancer may be sufficient.
If the sound of interest never appears in text or vision (ambiguous) and has no clear time signature, prompting might not add value.
If legal or ethical rules restrict modifying certain recordings (privacy or rights issues), avoid separation.

Open Questions:

How can we build stronger, more reliable audio–visual grounding for off-screen or quiet sources?
Can the model self-correct when prompts are wrong (e.g., user typed 'guitar' but it’s a ukulele)?
What’s the best way to separate many similar instruments at once—e.g., 'violin 1' vs 'violin 2'?
How can we teach fine-grained style control (keep natural room reverb vs remove it) consistently across domains?
Can on-device, low-latency versions approach large-model quality for real-time use (hearing aids, AR/VR)?

🍞 Anchor: Think of using SAM Audio for a school podcast. It’s fantastic at pulling out your host’s voice and music beds—but if two identical singers overlap off-screen, you may still need a careful visual or time prompt (and good mic placement) to get perfect results.

06Conclusion & Future Work

🍞 Hook: Imagine telling a super-helper exactly what sound you want, where it is in the video, and when it happens—and getting it, clean and ready to use.

🥬 3-Sentence Summary: SAM Audio is a foundation model that unifies text (what), visual masks (where), and time spans (when) to separate any target sound, while also producing the residual. Built on a diffusion transformer with flow matching in a DAC-VAE audio space, it preserves sound quality and works across speech, music, instruments, and everyday noises. It comes with a realistic benchmark (SAM Audio-Bench) and a strong, reference-free judge (SAJ) that aligns with human listeners.

Main Achievement: Turning audio separation into a flexible, multimodal, generative task—so users can guide the model with words, clicks, and timing—and outperforming both general-purpose and specialized systems in challenging, real-world conditions.

Future Directions: Strengthen audio–visual grounding for off-screen or quiet sources, refine control over style (reverb, room tone), improve multi-source disentanglement among very similar sounds, and bring high quality to lighter, real-time models for wearables and AR/VR.

Why Remember This: It’s a leap from 'fixed, one-size-fits-all' separation to 'separate anything, your way,' making audio editing, accessibility, education, research, and creative work more precise and far more approachable for everyone.

🍞 Anchor: Next time you edit a class video, you can simply say 'piano,' click the pianist, mark 10–20 seconds—and get a beautiful piano stem plus the rest, no studio magic required.

Practical Applications

•Podcast cleanup: Extract the host’s voice from a noisy café while keeping natural tone.
•Lecture focus: Isolate a teacher’s speech from classroom chatter to make study clips.
•Video editing: Remove background music from a vlog or isolate a sound effect for emphasis.
•Music education: Solo an instrument (e.g., violin) from a performance for practice or analysis.
•Film post-production: Separate specific foley (e.g., footsteps) or dialogue regions with visual clicks and spans.
•Accessibility: Help hearing aids focus on a chosen speaker in multi-talker environments.
•Customer support: Isolate machine noises (e.g., squeaks) from recordings for faster diagnostics.
•Forensics and archival work: Extract target voices or sounds from historical or field recordings without altering authenticity.
•Live event enhancement: Feed instrument-specific stems to mixers for clearer sound reinforcement.
•Content moderation and search: Pull out and analyze specific sound events (e.g., alarms) across large media libraries.

Version: 1