ElevenLabs' Full Audio Stack, GPT-5 & More
Checkout the voice ai startup that is automating the front desk for behavioral health clinics
Welcome to Voice AI Weekly!
If you are new here, every week we share the biggest news of the Voice and AI space that have hit our radar. We surf the web so you don’t have to.
In today’s weekly roundup we have:
How does OpenAI’s GPT-5, affect the latency
ElevenLabs new move in music generation
Attention Lab’s solution to isolate voices from noise
Cursor brings its coding agent to the command line
🤖OpenAI Launches GPT-5
What is it?
OpenAI shipped GPT-5. It has s a router system that merges their previous models like GPT-4o and dynamically allocates resources based on query complexity. It’s available to all users now.
The details:
A unified model architecture routes tasks to the most appropriate internal model, from simple queries to complex reasoning.
It sets new state-of-the-art results on benchmarks for math (AIME 2025), coding (SWE-Bench), and multimodal understanding.
Expanded agentic capabilities allow it to execute multi-step tasks using tools like Gmail and Google Calendar.
The API context window is up to 272k tokens for input and 128k for output.
Factual errors are reportedly reduced by 45% compared to GPT-4o when using web search.
What it means for the voice AI industry:
A more capable reasoning engine is a welcome development for complex tasks. But for real-time voice, the core bottleneck remains latency.
We've been running tests on the API since it dropped. Early results show no improvement in latency for conversational use cases.
Our internal benchmarks show end-to-end latency for the base GPT-5 model is between 1500ms and 2700ms. The 'mini' and 'nano' versions are faster but still not a leap forward for voice.
🎵ElevenLabs Enters AI-Generated Music
What is it?
ElevenLabs has launched Eleven Music. Its a text-to-music generation platform that can generate studio-quality songs in under a minute. This expands the company’s scope from speech synthesis into full music composition, with controls for genre, instrumentation, multilingual vocals, and post-generation editing.
The details:
Eleven Music generates songs from text prompts, allowing users to specify genre, instrumentation, and vocals.
It builds on their core voice synthesis tech, producing high-fidelity vocals that early demos suggest compete with or exceed models from Suno and Udio.
An API is available for developers to integrate programmatic music generation into apps and games.
They are marketing it as commercially safe, claiming to have worked with industry partners on training data to mitigate copyright issues.
What it means for the voice AI industry:
ElevenLabs is building two distinct platforms on top of the same core research: one for creatives and one for conversational agents. Eleven Music strengthens the creative side today, letting creators, studios, and entertainment teams add fully licensed, studio-quality music to any project, from voice-overs to films to games.
For builders, this marks a shift in competitive edge. The future of “good” voice agents won’t just be judged by their conversational intelligence but by how well they can orchestrate the entire audio layer - voice, sound effects, and now music into a coherent, emotionally engaging experience
🗣️Attention Labs Launches SAA to Isolate Voices in Noise
What is it?
Attention Labs has launched its Selective Auditory Attention (SAA) technology, an on-device audio engine that can isolate specific voices in real time, even in noisy environments with multiple people talking. It is said to achieve sub-100ms latency, works across 2–8 microphones, and requires no cloud processing, keeping all audio private.
The details:
SAA isolates a target speaker from background noise and overlapping speech, processing audio locally on the device.
It delivers sub-100ms latency with low power consumption, requiring no cloud connection and keeping user data private.
The tech is designed for hardware OEMs and integrates with devices using 2 to 8 microphones. Partners include Meta, Nvidia, and Samsung.
What it means for the voice AI industry:
The audio input layer is getting serious. For years, the focus has been on cloud-based transcription and LLMs, but the quality of the raw audio stream has been a major bottleneck.
On-device pre-processing to clean up audio before it ever hits a transcription model is becoming a critical component. For builders, this means the problem space is expanding.
Solum Health: The AI Agent Built for Behavioral Health Clinics.
The Problem:
Behavioral health clinics are buried in administrative work. Staff spend hours on hold with insurance companies and chasing intake forms instead of focusing on patient care. Every delayed intake is a lost patient.
The Build:
Solum Health built a platform that deploys AI agents to automate healthcare operations. intake, insurance verification, and scheduling for behavioral health clinics.
The agents integrate with a clinic's EHR to handle the full administrative workflow, cutting admin time by 50% and increasing patient bookings by up to 30%.
OpenAI releases GPT-5 and open-weight gpt-oss for flagship reasoning and local agent builds.[Read more]
Anthropic ships Claude Opus 4.1, outperforming GPT-4o on coding benchmarks for complex agentic tasks. [Read more]
Cursor ships a CLI, bringing its coding agent to any terminal, server, or CI/CD pipeline. [Read more]
Google ships Storybook, a Gemini feature that generates illustrated, narrated stories from text or voice prompts. [Read more]
Want to get your company featured in Built On Vapi section?
Fill up this form and we will pick 1 new startup form the submissions to feature here every week.
Also feel free to just reply to this email with suggestions (we read everything you send us)!
would u also share details on the Voice Agent tech stack they and others real time voice agents company are using to help us learn from it?