Gemini Live and GPT-4o Realtime API

Section 39.3

"The protocols are 60% the same. The 40% difference is where production engineering hours go."

EchoEcho, Realtime-Wrangling AI Agent
Big Picture

Both GPT-4o Realtime and Gemini Live expose a WebSocket-based, event-driven protocol for streaming audio and tool calls. The shapes are similar: open a session with configuration, push audio frames over the socket, receive incremental text and audio deltas. The differences are in turn-detection semantics, function-calling shape, session state model, and the audio-codec format. This section walks through both protocols side-by-side with executable code, then covers the production patterns (reconnection, partial-state recovery, tool dispatch) that distinguish a demo from a shipping product.

Prerequisites

This section assumes the speech-recognition and TTS pipelines from Section 20.1 and Section 20.5, the frontier-API patterns from Section 14.1, and the streaming-API patterns from Section 11.4.

Side-by-side sequence diagrams of GPT-4o Realtime and Gemini Live sessions, showing event types and message ordering for a single turn
Figure 39.3.1: Sequence diagrams for one conversational turn on each API. Both protocols are event-streamed, but event names, the exact set of session-update fields, and the function-calling event shape differ.

39.3.1 Protocol Overview

Fun Fact

GPT-4o Realtime and Gemini Live both ship over WebSockets, but they disagree on almost everything else: frame size, codec, event names, and whether the model can interrupt the user. The disagreement is not architectural; it is brand. Each vendor wants their SDK to be the one developers learn first.

Both APIs use a long-lived WebSocket connection. Once authenticated, the client sends a session-update event with configuration (voice, modalities, tools, system prompt), then streams audio chunks. The server emits incremental events as the model processes input and generates output.

The minimal event types are:

ConceptGPT-4o RealtimeGemini Live
Session configsession.updateBidiGenerateContentSetup
Send audio chunkinput_audio_buffer.appendrealtimeInput.audio
Send text chunkconversation.item.createclientContent.turns
Commit audio turninput_audio_buffer.commit + response.createImplicit (server VAD)
Server emits textresponse.text.deltaserverContent.modelTurn.parts.text
Server emits audioresponse.audio.deltaserverContent.modelTurn.parts.inlineData
Function callresponse.function_call_arguments.deltatoolCall.functionCalls
Function resultconversation.item.create (function_call_output)toolResponse.functionResponses
Interruption (user spoke)input_audio_buffer.speech_startedserverContent.interrupted
Table 39.3.2: Side-by-side event taxonomy. The conceptual primitives match; the names and the message envelopes do not. Production code that targets both APIs typically wraps them in a thin adapter.

39.3.2 GPT-4o Realtime: Full Walkthrough

The GPT-4o Realtime endpoint is wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview. Authentication is via a standard bearer token plus the OpenAI-Beta: realtime=v1 header. The minimal full lifecycle:

import asyncio, json, base64, os
import websockets

URL = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
HEADERS = {
    "Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
    "OpenAI-Beta": "realtime=v1",
}

async def run():
    async with websockets.connect(URL, extra_headers=HEADERS) as ws:
        # 1. Configure session: enable both modalities, set voice and tools.
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "voice": "shimmer",
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "turn_detection": {"type": "server_vad", "threshold": 0.5},
                "instructions": "You are a helpful assistant.",
                "tools": [{
                    "type": "function",
                    "name": "lookup_weather",
                    "description": "Get weather for a city",
                    "parameters": {"type": "object",
                        "properties": {"city": {"type": "string"}}},
                }],
            },
        }))

        # 2. Stream microphone audio in 100ms chunks (16kHz PCM-16).
        mic_task = asyncio.create_task(stream_mic(ws))

        # 3. Receive server events and play audio output.
        async for raw in ws:
            ev = json.loads(raw)
            if ev["type"] == "response.audio.delta":
                pcm = base64.b64decode(ev["delta"])
                speaker.write(pcm)
            elif ev["type"] == "response.function_call_arguments.done":
                result = await handle_tool(ev["name"],
                                            json.loads(ev["arguments"]))
                await ws.send(json.dumps({
                    "type": "conversation.item.create",
                    "item": {
                        "type": "function_call_output",
                        "call_id": ev["call_id"],
                        "output": json.dumps(result),
                    },
                }))
                await ws.send(json.dumps({"type": "response.create"}))
            elif ev["type"] == "input_audio_buffer.speech_started":
                speaker.flush()  # interruption
Code Fragment 39.3.1a: Minimal but complete GPT-4o Realtime client. With server_vad turn detection, you do not have to commit audio buffers manually; the server detects end-of-speech and emits a response. Function calls round-trip via conversation.item.create with the function_call_output item type.

39.3.3 Gemini Live: Full Walkthrough

Gemini Live uses a similar WebSocket pattern but with Google's bidi-streaming envelope. The endpoint is wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent. Authentication is via API key as a query parameter.

from google import genai
import asyncio

client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])
MODEL = "models/gemini-2.5-flash-live-preview"

async def run():
    config = {
        "response_modalities": ["AUDIO"],
        "speech_config": {"voice_config":
            {"prebuilt_voice_config": {"voice_name": "Aoede"}}},
        "system_instruction": {"parts":
            [{"text": "You are a helpful assistant."}]},
        "tools": [{"function_declarations": [
            {"name": "lookup_weather",
             "description": "Get weather for a city",
             "parameters": {"type": "OBJECT",
                "properties": {"city": {"type": "STRING"}}}}
        ]}],
    }

    async with client.aio.live.connect(model=MODEL, config=config) as session:
        # 1. Mic streaming task
        async def send_audio():
            async for chunk in mic_chunks():
                await session.send_realtime_input(
                    audio={"data": chunk, "mime_type": "audio/pcm"}
                )

        asyncio.create_task(send_audio())

        # 2. Receive loop
        async for resp in session.receive():
            if resp.data:
                speaker.write(resp.data)  # PCM audio chunk
            if resp.tool_call:
                for call in resp.tool_call.function_calls:
                    result = await handle_tool(call.name, call.args)
                    await session.send_tool_response(
                        function_responses=[{"id": call.id,
                            "name": call.name, "response": result}]
                    )
            if resp.server_content and resp.server_content.interrupted:
                speaker.flush()
Code Fragment 39.3.2a: Gemini Live client using the official google-genai SDK. The SDK wraps the raw bidi-streaming gRPC; you can also drop down to the WebSocket bridge if you prefer. Function declarations follow JSON Schema with the all-caps Google convention (STRING, OBJECT).

39.3.4 Session Management and Reconnection

Production realtime clients must survive transient WebSocket disconnects. Both APIs let you continue a session by replaying the conversation history on a new connection:

Neither API persists session state server-side beyond the WebSocket lifetime. The client owns the conversation history, which simplifies the data model (no session IDs to manage) but means the client must implement reconnection logic carefully. Production patterns typically save the running transcript to a database every few turns so a refresh or device handoff can resume cleanly.

Note: Both APIs cap sessions at 15 to 30 minutes

Each provider enforces a maximum session duration (currently 30 minutes for GPT-4o Realtime, 15 minutes for Gemini Live, both subject to change). Long-running voice agents must implement transparent session rotation: open a new socket before the old one expires, transfer the conversation context, and switch the audio stream over with minimal perceived gap. The pattern looks like a TCP handoff; expect to spend an afternoon getting it right.

39.3.5 Function Calling in Realtime

Function calling in a streaming audio context introduces a unique problem: the model wants to call a function mid-speech, but the function call blocks the audio response. Both APIs handle this by emitting a function-call event mid-stream, expecting the client to respond with a function output, and then resuming audio generation.

Best practices that emerged in 2025:

  1. Acknowledge before calling: prompt the model to say a brief "let me check" before invoking a tool, so the user perceives responsiveness while the tool runs.
  2. Parallel function dispatch: if multiple tools can be called, dispatch them in parallel rather than serially.
  3. Streaming tool results: for long-running tools, send incremental partial results so the model can begin speaking before the tool fully resolves.
  4. Timeouts: enforce a max latency on tool calls. If the weather API takes more than 1 second, return a "weather lookup pending" placeholder so the conversation does not stall.
Key Insight: The acknowledgment pattern

Without a verbal acknowledgment, the user perceives a silent gap during tool execution. With "let me check that for you" emitted by the model before the tool call, the gap is filled with natural speech and the perceived latency drops dramatically even if the tool is slow. The trick is in the system prompt: instruct the model to always preface tool calls with a brief verbal acknowledgment. Both APIs honor this pattern reliably.

39.3.6 Debugging and Observability

Realtime APIs are noisier than text APIs and harder to debug. Recommended observability:

# Minimal observability wrapper for either realtime API.
# Logs every event with monotonic-clock timestamps to a JSONL file.
import time, json

class EventLogger:
    def __init__(self, path):
        self.f = open(path, "a")
        self.t0 = time.monotonic()

    def log(self, direction, event):
        elapsed_ms = (time.monotonic() - self.t0) * 1000
        self.f.write(json.dumps({
            "t_ms": round(elapsed_ms, 1),
            "dir": direction,
            "type": event.get("type") or type(event).__name__,
            "summary": str(event)[:200],
        }) + "\n")
        self.f.flush()
Code Fragment 39.3.3: A 15-line event logger. Wrap every send and receive with logger.log("out", payload) or logger.log("in", event). The resulting JSONL is invaluable for diagnosing latency spikes and protocol-level mismatches.
Warning: Audio format mismatches are the #1 silent failure

Both APIs accept PCM-16 mono at specific sample rates (24 kHz for GPT-4o output, 16 kHz for Gemini Live input). Sending audio at the wrong rate gets you no error: the server processes the bytes as though they were the expected rate, and the model produces garbage output. Always log the sample rate from your audio capture chain and add an assertion at the WebSocket boundary.

Real-World Scenario: A Multi-Provider Realtime Adapter

Who: A 2025 customer-support startup running a voice agent in production on a single realtime provider.

Situation: Leadership wanted to A/B test GPT-4o Realtime versus Gemini Live in production to find the cheaper provider for their actual workload mix.

Problem: The two APIs used different event names, schemas, voice IDs, and audio formats, so the application code was tightly bound to whichever provider was wired in first.

Dilemma: Either fork the codebase and run two parallel deployments (high engineering cost, hard to A/B test), or build a thin abstraction and risk the leaky-abstraction tax for the gain in flexibility.

Decision: They built a thin adapter layer rather than forking the application.

How: The adapter was about 600 lines of Python exposing a unified RealtimeSession interface with methods send_audio, send_tool_response, and an async iterator of AudioChunk, ToolCall, and InterruptionEvent objects, normalizing voice IDs, sample rates, and function-call schemas.

Result: The A/B test showed Gemini Live was 18% cheaper for their workload but had higher TTFB; they shipped a mixed-provider stack with intelligent routing based on conversational urgency.

Lesson: A thin adapter over realtime providers is one of the highest-leverage engineering investments for any voice product, because it converts provider choice from an architectural commitment into a per-request routing decision.

Key Insight

GPT-4o Realtime and Gemini Live share the same conceptual primitives, WebSocket session, streaming audio, server-VAD turn detection, function calling, but differ in event names, schemas, and SDKs. Production code should wrap them in a thin adapter so you can switch or A/B test. Session management is client-owned; both APIs cap sessions at 15 to 30 minutes and require client-side reconnection. Function calling needs an acknowledgment pattern to avoid silent gaps. Instrument every event with timestamps; audio sample-rate mismatches are silent failures that observability catches before users do.

Self-Check
Q1: You need to switch your audio agent between GPT-4o Realtime and Gemini Live based on cost. What is the minimum adapter surface you need to expose, and which 3 events are most likely to behave differently?
Show Answer
The minimum surface is a unified RealtimeSession exposing send_audio, send_tool_response, and an async iterator yielding AudioChunk, ToolCall, and InterruptionEvent values. The three events most likely to behave differently across providers are turn commit (GPT-4o needs an explicit input_audio_buffer.commit plus response.create, Gemini Live commits implicitly via server VAD), function-call shape (response.function_call_arguments.delta versus toolCall.functionCalls, with different argument schemas), and interruption (input_audio_buffer.speech_started versus serverContent.interrupted). The adapter must normalize these three plus the sample-rate and voice-name differences.
Q2: A user complains that the agent goes silent for 4 seconds whenever it looks up an order. The order-lookup tool returns in 800 ms. What's the fix that doesn't require a faster tool?
Show Answer
The gap is unfilled audio while the tool runs and while the model spins back up after receiving the result. The fix is the acknowledgment pattern: instruct the model in its system prompt to emit a brief verbal acknowledgment such as "let me check that order" before invoking the tool, so the user hears natural speech filling the otherwise silent window. The model is already mid-response when it dispatches the tool, so the perceived gap collapses even though the tool latency is unchanged.
Q3: Sketch a reconnection protocol for a 25-minute voice conversation that needs to survive the 30-minute session cap and a transient 3-second network drop.
Show Answer
Maintain the running conversation transcript on the client and persist it every few turns. Two triggers cause a new socket: a scheduled rotation before the 30-minute cap fires, and a detected disconnect. On either trigger, open a new WebSocket, send the session.update or BidiGenerateContentSetup configuration, replay prior turns (conversation.item.create events for GPT-4o, clientContent.turns for Gemini Live), then resume audio streaming. For the 3-second network drop, buffer microphone audio locally during the gap so no user speech is lost, and overlap the new socket setup with the old socket teardown to minimize the perceived audio gap.
Q4: Why is "log every event with monotonic timestamps" more valuable in realtime APIs than in non-streaming LLM APIs?
Show Answer
Non-streaming APIs have one request and one response, so latency is just request-time minus response-time. Realtime APIs emit dozens of fine-grained events per turn (audio deltas, partial transcripts, function-call deltas, interruptions), and the user-perceived latency depends on the timing relationships between them. Monotonic timestamps on every event let you reconstruct exactly when first audio arrived, how long the gap between speech_started and audio flush was, and which stage is dropping frames. Wall-clock timestamps can drift; monotonic clocks preserve ordering across the whole session for accurate latency forensics.

What Comes Next

Section 39.4: Audio Token Budget and Latency Engineering goes one layer deeper into the audio token economics: codec choice, token rate, voice-activity tuning, and the math behind sub-300-ms TTFT.

Further Reading

API Documentation

OpenAI. (2024-2025). "Realtime API Reference." platform.openai.com/docs/api-reference/realtime
Google. (2024-2025). "Gemini Live API Reference." ai.google.dev/api/generate-content

Protocol Design Discussions

OpenAI. (2024). "Introducing the Realtime API." openai.com/index/introducing-the-realtime-api
Pipecat. (2025). "Adapter patterns for multi-provider voice agents." docs.pipecat.ai

WebRTC and Transport

LiveKit. (2024-2025). "LiveKit Agents documentation." docs.livekit.io/agents

Production Best Practices

OpenAI. (2025). "Best practices for the Realtime API." (Blog post.) openai.com/index