TV9
user profile
Sign In

By signing in or creating an account, you agree with Associated Broadcasting Company's Terms & Conditions and Privacy Policy.

Meta’s SAM Audio can isolate voices and noise from videos in one click

Meta has introduced SAM Audio, a new AI model that lets users isolate specific sounds from complex audio using text, visual cues, or time markers. The tool expands Meta's Segment Anything work beyond images and video into audio. SAM Audio is available through the Segment Anything Playground starting today.

Meta launches SAM Audio to separate sound using text, video, and time prompts
Meta launches SAM Audio to separate sound using text, video, and time prompts
| Updated on: Dec 17, 2025 | 12:40 PM

New Delhi: Meta has added a new chapter to its Segment Anything story, this time focused on sound. The company has introduced SAM Audio, a new artificial intelligence model designed to separate and isolate sounds from complex audio. The announcement positions audio as the next frontier after images and video, an area that has long remained messy and tool-heavy for creators and researchers.

For anyone who has struggled to clean background noise from a video or podcast, the idea feels familiar. Audio editing often means jumping between tools, filters, and guesswork. Meta says SAM Audio aims to simplify that process by letting people interact with sound in ways that feel natural, using text, visuals, or time markers, according to details shared in its official blog post.

Also Read

What Meta’s SAM Audio actually does

Meta describes SAM Audio as “a first-of-its-kind model for segmenting sound.” In its words, “We’re introducing SAM Audio, a state-of-the-art unified model that transforms audio processing by making it easy to isolate any sound from complex audio mixtures using natural, multimodal prompts.”

At a basic level, SAM Audio lets users pull out a specific sound from audio that has many things happening at once. A guitar from a band performance. A voice from traffic noise. A barking dog from an entire podcast recording.

The model supports three types of prompts, which can work alone or together.

  • Text prompts like typing “dog barking” or “singing voice”
  • Visual prompts by clicking on a person or object in a video
  • Span prompts where users mark time segments where the sound appears

Meta calls span prompting an industry first, saying it helps fix issues across an entire clip instead of frame by frame .

The engine behind the model

Under the hood, SAM Audio runs on something Meta calls Perception Encoder Audiovisual, or PE-AV. Meta explains it using a simple analogy, calling PE-AV “the ears” and SAM Audio “the brain.”

PE-AV builds on the open source Perception Encoder model Meta released earlier this year. It aligns video frames with audio at precise moments in time, allowing the system to understand what is being seen and heard together. This matters when the sound source is on screen, like a speaker or instrument, and even when it is off screen but hinted by the scene.

Meta says PE-AV was trained on over 100 million videos using large-scale multimodal learning, pulling from open datasets and synthetic captioning pipelines.

Meta’s new AI model lets users pick and separate any sound | Source: Meta AI

Benchmarks and judging sound quality

Alongside the model, Meta also released tools to measure performance. One is SAM Audio-Bench, described as the first in-the-wild audio separation benchmark. Unlike older datasets that rely on synthetic mixes, this benchmark uses real audio and video with text, visual, and time-based prompts.

Another is SAM Audio Judge, an automatic judge model designed to assess audio quality without needing reference tracks. Meta says this approach mirrors how humans actually judge sound, instead of relying only on technical comparisons .

As someone who once spent an entire evening adjusting audio levels for a short interview clip, the idea of reference-free judging feels oddly relatable. You usually know when something sounds off, even if you cannot explain why.

Where Meta sees this being used

Meta says SAM Audio could shape tools across music, podcasting, television, film, and accessibility. The company also confirmed partnerships with Starkey, a hearing aid manufacturer, and 2gether-International, a startup accelerator for disabled founders, to explore accessibility use cases .

The model runs faster than real time and supports large parameter sizes, though Meta admits there are limits. Audio cannot be used as a prompt yet, and separating very similar sounds, like one singer in a choir, remains difficult.

SAM Audio is available starting today through the Segment Anything Playground, where users can upload their own audio or video to try it out. Meta has also made the model available for download.

{{ articles_filter_432_widget.title }}