Meta’s SAM Audio can isolate voices and noise from videos in one click

Meta has introduced SAM Audio, a new AI model that lets users isolate specific sounds from complex audio using text, visual cues, or time markers. The tool expands Meta's Segment Anything work beyond images and video into audio. SAM Audio is available through the Segment Anything Playground starting today.

Meta launches SAM Audio to separate sound using text, video, and time prompts

Siddharth Shankar | Updated on: Dec 17, 2025 | 12:40 PM

New Delhi: Meta has added a new chapter to its Segment Anything story, this time focused on sound. The company has introduced SAM Audio, a new artificial intelligence model designed to separate and isolate sounds from complex audio. The announcement positions audio as the next frontier after images and video, an area that has long remained messy and tool-heavy for creators and researchers.

For anyone who has struggled to clean background noise from a video or podcast, the idea feels familiar. Audio editing often means jumping between tools, filters, and guesswork. Meta says SAM Audio aims to simplify that process by letting people interact with sound in ways that feel natural, using text, visuals, or time markers, according to details shared in its official blog post.

What Meta’s SAM Audio actually does

Meta describes SAM Audio as “a first-of-its-kind model for segmenting sound.” In its words, “We’re introducing SAM Audio, a state-of-the-art unified model that transforms audio processing by making it easy to isolate any sound from complex audio mixtures using natural, multimodal prompts.”

At a basic level, SAM Audio lets users pull out a specific sound from audio that has many things happening at once. A guitar from a band performance. A voice from traffic noise. A barking dog from an entire podcast recording.

The model supports three types of prompts, which can work alone or together.

Text prompts like typing “dog barking” or “singing voice”
Visual prompts by clicking on a person or object in a video
Span prompts where users mark time segments where the sound appears

Meta calls span prompting an industry first, saying it helps fix issues across an entire clip instead of frame by frame .

The engine behind the model

Under the hood, SAM Audio runs on something Meta calls Perception Encoder Audiovisual, or PE-AV. Meta explains it using a simple analogy, calling PE-AV “the ears” and SAM Audio “the brain.”

PE-AV builds on the open source Perception Encoder model Meta released earlier this year. It aligns video frames with audio at precise moments in time, allowing the system to understand what is being seen and heard together. This matters when the sound source is on screen, like a speaker or instrument, and even when it is off screen but hinted by the scene.

Meta says PE-AV was trained on over 100 million videos using large-scale multimodal learning, pulling from open datasets and synthetic captioning pipelines.

Meta’s new AI model lets users pick and separate any sound | Source: Meta AI

Benchmarks and judging sound quality

Alongside the model, Meta also released tools to measure performance. One is SAM Audio-Bench, described as the first in-the-wild audio separation benchmark. Unlike older datasets that rely on synthetic mixes, this benchmark uses real audio and video with text, visual, and time-based prompts.

Another is SAM Audio Judge, an automatic judge model designed to assess audio quality without needing reference tracks. Meta says this approach mirrors how humans actually judge sound, instead of relying only on technical comparisons .

As someone who once spent an entire evening adjusting audio levels for a short interview clip, the idea of reference-free judging feels oddly relatable. You usually know when something sounds off, even if you cannot explain why.

Where Meta sees this being used

Meta says SAM Audio could shape tools across music, podcasting, television, film, and accessibility. The company also confirmed partnerships with Starkey, a hearing aid manufacturer, and 2gether-International, a startup accelerator for disabled founders, to explore accessibility use cases .

Add TV9 English As A Trusted Source

The model runs faster than real time and supports large parameter sizes, though Meta admits there are limits. Audio cannot be used as a prompt yet, and separating very similar sounds, like one singer in a choir, remains difficult.

SAM Audio is available starting today through the Segment Anything Playground, where users can upload their own audio or video to try it out. Meta has also made the model available for download.

Meta’s SAM Audio can isolate voices and noise from videos in one click

What Meta’s SAM Audio actually does

The engine behind the model

Benchmarks and judging sound quality

Where Meta sees this being used

Latest

Uttarakhand Weather: Dense fog alert in five districts, snowfall continues in Badrinath-Kedarnath, Mussoorie hit by hailstorm

Uttarakhand sets new tourism record; over 6 crore visitors in 2025, Haridwar tops with 3.42 crore pilgrims

Pat Cummins' IPL 2026 playing chances rest on recovery from back injury

Simple nail care guide to prevent damage for women who cook daily

Scare prompts Air India to reinspect fuel control switches on Boeing 787s

Mysterious figure appears in sky during extreme solar storms

Report: Porsche could kill their electric sports car project

The Epstein Files probe: Bill and Hillary Clinton agree to testify before Congress, slam Committee for 'partisan-politics'

Aviation watchdog orders special audit of VSR Ventures as probe into Ajit Pawar plane crash intensifies

{{ item.title }}