Voice model pricing, limits, and session behavior change quickly. The pricing and feature details below were checked against official provider docs and pricing pages on April 24, 2026 UTC and should be rechecked before production rollout.
By Marcus Reed, AI infrastructure analyst at Deep Digital Ventures. Reviewed by Priya Shah, cloud cost engineer. Marcus tracks model pricing, realtime API architecture, and production AI unit economics for teams budgeting agent deployments.
Most teams still underestimate AI voice model costs in 2026 because they compare vendors like this is just another chatbot pricing page. It is not. Production voice systems bill for listening time, speech output, tool calls, conversation history, interruption handling, and often a second text or reasoning model behind the live session.
List price alone rarely tells you what GPT-Realtime, Gemini Live, or Nova Sonic will cost to run. The useful question is how each provider meters speech versus text, how realtime architecture affects latency and barge-in waste, and what a realistic call costs after tool use and conversation state are included.
Key takeaways
- Voice AI economics are usually driven more by audio output, interruption waste, and backend routing than by a single headline model price.
- OpenAI’s current
gpt-realtime-1.5is the premium-priced option in this group; Gemini 2.5 Flash Native Audio has the lowest speech-token rates in this comparison; Amazon Nova Sonic is close on speech and much cheaper on text in US East. - Speech tokens and text tokens are priced separately for all three providers, which means transcripts, tool calls, grounding, and memory can become their own line item.
- In the baseline three-minute example below, the modeled per-call cost is about $0.149 for OpenAI, $0.0249 for Gemini Live, and $0.0276 for Nova Sonic before telephony, storage, taxes, retries, discounts, or separate safety and reasoning models.
Current pricing snapshot
| Model | Realtime architecture | Verified 2026 pricing | What the bill is really counting |
|---|---|---|---|
OpenAI gpt-realtime-1.5[1] |
Realtime API over WebRTC, WebSocket, or SIP[2] | Text input: $4.00 per 1M Text output: $16.00 per 1M Audio input: $32.00 per 1M Audio output: $64.00 per 1M |
One realtime model handles voice interaction directly, but audio tokens carry a steep premium versus text. Good fit when voice quality, tooling, and telephony/browser flexibility matter more than raw cost. |
| Google Gemini 2.5 Flash Native Audio (Live API preview)[3] | Stateful Live API sessions over WebSockets with barge-in support[4] | Text input: $0.50 per 1M Text output: $2.00 per 1M Audio or video input: $3.00 per 1M Audio output: $12.00 per 1M |
Google’s native audio Live API is priced much closer to text than OpenAI’s realtime lane. It is compelling on price-performance, but it is still a preview product and should be treated that way operationally. |
| Amazon Nova Sonic on Bedrock[5] | Bedrock bidirectional streaming API[6] | US East (N. Virginia) pricing: Speech input: $3.40 per 1M Speech output: $13.60 per 1M Text input: $0.06 per 1M Text output: $0.24 per 1M |
AWS separates speech and text billing clearly. Text pricing applies to things like transcription, tool calls, grounding, and conversation history, not just visible replies. Region-specific pricing matters. |
That table is the starting point, not the full answer. Gemini has the lowest listed speech-token rates here, Nova Sonic has the lowest text-token rates, and OpenAI is the premium option. Production spend still depends on how much audio you stream, how long the assistant talks, and how often your system routes harder turns into another model.
Where voice AI cost actually comes from
| Cost driver | Why it matters | What disciplined teams do |
|---|---|---|
| Listening time | An always-on session can accumulate speech input tokens when the user is hesitant, noisy, or silent. | Use VAD, idle cutoffs, and clear session boundaries instead of leaving sessions open by default. |
| Response length | Audio output is often the most expensive part of a successful turn, especially on premium realtime models. | Optimize for short spoken answers first, then expand only when the user asks for more detail. |
| Interruptions and barge-in | Natural interruption handling improves UX, but partial generations that get canceled still consume time, tokens, or both. | Tune turn-taking carefully instead of assuming more aggressive interruption handling is always cheaper. |
| Tool calls and retrieval | Agents that look up orders, schedules, CRM records, or internal documents create extra text-token spend. | Track tool-heavy flows separately from pure conversation flows. |
| Conversation history | Realtime sessions can keep state, but long histories increase text-token usage and can raise latency. | Summarize or trim history instead of replaying every turn forever. |
| Fallback reasoning models | Many systems route hard turns into a second model for policy, planning, or post-call summarization. | Measure voice-front-end cost and backend escalation cost separately so routing economics stay visible. |
The practical formula is simple: speech input plus speech output plus text work plus any backend model invoked for difficult parts. A team can choose a cheap speech model and still end up with an expensive system if every call triggers long tool traces, summaries, or policy checks.
Pricing math: speech tokens, text tokens, and a three-minute call
One of the biggest buyer mistakes is assuming that a voice system is basically a text model with a microphone attached. The pricing says otherwise. OpenAI charges $4 per 1M text input tokens but $32 per 1M audio input tokens, and $16 per 1M text output tokens but $64 per 1M audio output tokens. Gemini’s gap is smaller, and Nova Sonic’s text pricing is extremely low, but all three still separate speech from text.
For a transparent baseline, use an audio estimate of 25 tokens per second, then apply each provider’s per-token rates.[7] This is a modeling assumption for comparison, not a promise that every provider tokenizes speech identically. In the baseline call, assume a three-minute support interaction with 36 seconds of customer speech and 72 seconds of assistant speech: 36 seconds x 25 = 900 input audio tokens, and 72 seconds x 25 = 1,800 output audio tokens. Add 400 text input tokens and 200 text output tokens for tool calls and transcript memory.
The formula is (tokens / 1,000,000) x rate. Excluded costs: telephony carrier or SIP charges, WebRTC infrastructure, call recording and storage, observability, retries, test traffic, taxes, enterprise discounts, prompt caching, regional deltas outside US East, and separate moderation, safety, or reasoning models.
| Scenario | Assumption | OpenAI | Gemini Live | Nova Sonic |
|---|---|---|---|---|
| Baseline support call | 900 audio in, 1,800 audio out, 400 text in, 200 text out | $0.149 | $0.0249 | $0.0276 |
| Low-tool-use call | Same speech, but only 100 text in and 50 text out | $0.145 | $0.0245 | $0.0276 |
| High-interruption call | Same user speech, 40% more generated audio, 700 text in, 300 text out | $0.198 | $0.0339 | $0.0374 |
At 100,000 baseline calls per month, that becomes roughly $14,900 for OpenAI, $2,490 for Gemini Live, and $2,760 for Nova Sonic before the excluded items above. The lesson is not to always pick the cheapest lane; it is that audio-output tokens and interruption waste dominate the math, while text-heavy tool flows can change the ranking for real workloads.
Latency and interruption handling are product decisions, not just model features
Low latency is not only a user experience metric. It changes how long sessions stay open, how often users talk over the assistant, and how much wasted generation you pay for. That makes interruption design an economic decision.
OpenAI’s realtime stack is operationally flexible because the current docs support WebRTC, WebSocket, and SIP connection paths.[2] That is useful for browser-native apps, call flows, and telephony bridges. The tradeoff is cost discipline. On a premium audio-output model, long spoken responses and repeated interruptions become expensive quickly.
Google’s Live API is explicit about barge-in: users can interrupt the model at any time, and the API is built around low-latency voice interactions over a stateful WebSocket connection.[4] That is good for natural conversation, but the cheapest implementation is not always the most natural one. If the assistant starts speaking too early or too long, you create interruption waste.
AWS frames Nova Sonic from an enterprise voice-agent angle. Its documentation emphasizes bidirectional audio streaming, low-latency multi-turn conversations, function calling, RAG, and graceful handling of interruptions without dropping context.[6] That makes Nova Sonic relevant to call automation and task-oriented agents where turn efficiency matters as much as raw model quality.
The commercial rule is straightforward: if users interrupt often, shorter first responses usually improve both perceived quality and cost. If the assistant speaks in long paragraphs, you pay for verbosity and for the tokens generated right before the user cuts it off.
Why the voice model is rarely the whole stack
Most serious voice systems are really two systems. The first is the live conversation layer that listens, speaks, and manages turn-taking. The second is the intelligence layer that does retrieval, structured decision-making, escalation, summarization, QA, or policy-heavy logic.
If that second layer uses separate text or reasoning models, the AI Models app can help compare those non-voice choices by price, context window, benchmarks, compatibility, freshness, and change history. For production voice systems, that backend routing choice can move total unit economics as much as the live speech model itself.
When voice ROI is worth it
| Use case | When voice usually pays off | When it usually does not |
|---|---|---|
| Customer support triage | When the agent can deflect repetitive calls, gather structured details, and route the right cases fast. | When the conversation is mostly empathy-heavy edge cases that still end up with a human every time. |
| Scheduling and booking | When tool integration is strong and the task flow is narrow enough to automate reliably. | When downstream systems are fragmented and every booking still requires manual cleanup. |
| Status checks and account actions | When users need quick answers while driving, walking, or calling in. | When the same task is already faster in a clean self-serve web flow. |
| Premium concierge or sales assistant | When higher conversion, larger average order value, or lower handle time clearly justifies the model spend. | When voice is added because it sounds modern rather than because it changes revenue or labor economics. |
Voice ROI is strongest when speed and convenience are part of the product value, not just an interface preference. If users are happier clicking than talking, voice can become an expensive layer that adds latency, engineering complexity, and monitoring burden without raising revenue or reducing labor enough to matter.
How to evaluate these models more honestly
A better evaluation framework is to compare each provider on four dimensions at once: speech-token price, interruption behavior, tool or telephony integration fit, and the cost of the text work behind the session. That is a much better proxy for production economics than asking which vendor has the lowest headline rate.
- Use OpenAI when premium realtime quality, telephony flexibility, and tooling are worth paying for.
- Use Gemini Live when you want the lowest speech-token pricing in this comparison and are comfortable with a preview-native audio stack.
- Use Nova Sonic when Bedrock fits your architecture, low text-token pricing matters, and enterprise voice-agent features improve the business case.
- Separate voice-front-end cost from backend model cost before committing to any provider.
FAQ
Which of these voice models is cheapest right now?
On the official rates checked on April 24, 2026, Gemini 2.5 Flash Native Audio has the lowest listed speech-token pricing in this comparison, Nova Sonic is close on speech and cheaper on text in US East, and OpenAI gpt-realtime-1.5 is the premium-priced option. That is a list-price statement, not a guarantee of lowest total task cost.
What is the biggest production voice AI cost mistake?
Letting the assistant talk too much and hiding backend escalation cost. Audio output and premium fallback routing are usually where margins disappear first.
Should one model handle everything?
Usually no. Many production stacks use one realtime model for the interaction layer and a separate text or reasoning model for tool planning, policy-sensitive decisions, summaries, or QA.
Does interruption handling really affect the bill?
Yes. Barge-in can make conversations feel much better, but partial responses, repeated restarts, and long first answers can all raise output-token usage. The cheapest voice agent is often the one that answers briefly, waits well, and escalates only when needed.
The most useful way to think about voice AI in 2026 is as a routing and margin problem. Once you separate speech cost, text cost, interruption waste, and backend model cost, GPT-Realtime, Gemini Live, and Nova Sonic stop looking like abstract brand choices and start looking like operating decisions.
Sources
- [1] OpenAI gpt-realtime-1.5 model and pricing page: https://developers.openai.com/api/docs/models/gpt-realtime-1.5
- [2] OpenAI Realtime API connection and implementation docs: https://developers.openai.com/api/docs/guides/realtime
- [3] Google Gemini API pricing page for Gemini 2.5 Flash Native Audio Live API: https://ai.google.dev/gemini-api/docs/pricing
- [4] Google Gemini Live API overview, including WebSocket protocol and barge-in behavior: https://ai.google.dev/gemini-api/docs/live-api
- [5] Amazon Bedrock pricing page for on-demand model rates by Region: https://aws.amazon.com/bedrock/pricing/
- [6] Amazon Nova Sonic speech-to-speech documentation: https://docs.aws.amazon.com/nova/latest/userguide/speech.html
- [7] Google Cloud pricing modality notes used as the public audio-token calibration point for the worked example: https://cloud.google.com/gemini-enterprise-agent-platform/generative-ai/pricing