By signing in or creating an account, you agree with Associated Broadcasting Company's Terms & Conditions and Privacy Policy.
New Delhi: Meta has added a new chapter to its Segment Anything story, this time focused on sound. The company has introduced SAM Audio, a new artificial intelligence model designed to separate and isolate sounds from complex audio. The announcement positions audio as the next frontier after images and video, an area that has long remained messy and tool-heavy for creators and researchers.
For anyone who has struggled to clean background noise from a video or podcast, the idea feels familiar. Audio editing often means jumping between tools, filters, and guesswork. Meta says SAM Audio aims to simplify that process by letting people interact with sound in ways that feel natural, using text, visuals, or time markers, according to details shared in its official blog post.
Meta describes SAM Audio as “a first-of-its-kind model for segmenting sound.” In its words, “We’re introducing SAM Audio, a state-of-the-art unified model that transforms audio processing by making it easy to isolate any sound from complex audio mixtures using natural, multimodal prompts.”
At a basic level, SAM Audio lets users pull out a specific sound from audio that has many things happening at once. A guitar from a band performance. A voice from traffic noise. A barking dog from an entire podcast recording.
The model supports three types of prompts, which can work alone or together.
Meta calls span prompting an industry first, saying it helps fix issues across an entire clip instead of frame by frame .
Under the hood, SAM Audio runs on something Meta calls Perception Encoder Audiovisual, or PE-AV. Meta explains it using a simple analogy, calling PE-AV “the ears” and SAM Audio “the brain.”
PE-AV builds on the open source Perception Encoder model Meta released earlier this year. It aligns video frames with audio at precise moments in time, allowing the system to understand what is being seen and heard together. This matters when the sound source is on screen, like a speaker or instrument, and even when it is off screen but hinted by the scene.
Meta says PE-AV was trained on over 100 million videos using large-scale multimodal learning, pulling from open datasets and synthetic captioning pipelines.
Alongside the model, Meta also released tools to measure performance. One is SAM Audio-Bench, described as the first in-the-wild audio separation benchmark. Unlike older datasets that rely on synthetic mixes, this benchmark uses real audio and video with text, visual, and time-based prompts.
Another is SAM Audio Judge, an automatic judge model designed to assess audio quality without needing reference tracks. Meta says this approach mirrors how humans actually judge sound, instead of relying only on technical comparisons .
As someone who once spent an entire evening adjusting audio levels for a short interview clip, the idea of reference-free judging feels oddly relatable. You usually know when something sounds off, even if you cannot explain why.
Meta says SAM Audio could shape tools across music, podcasting, television, film, and accessibility. The company also confirmed partnerships with Starkey, a hearing aid manufacturer, and 2gether-International, a startup accelerator for disabled founders, to explore accessibility use cases .
The model runs faster than real time and supports large parameter sizes, though Meta admits there are limits. Audio cannot be used as a prompt yet, and separating very similar sounds, like one singer in a choir, remains difficult.
SAM Audio is available starting today through the Segment Anything Playground, where users can upload their own audio or video to try it out. Meta has also made the model available for download.