Section 39.5: Open-Source Realtime: Moshi, Pipecat, LiveKit Agents

"The proprietary realtime APIs are great. They are also a single vendor with API price increases and a 30-minute session limit. Open source is the hedge."
Echo, Open-Realtime AI Agent

Big Picture

The open-source realtime stack in 2026 has three layers. At the model layer, Kyutai's Moshi is the foundational native audio-text model with full-duplex (simultaneous listen-and-speak) capability. At the orchestration layer, Pipecat and LiveKit Agents provide pipeline-style frameworks for composing ASR (Automatic Speech Recognition, the speech-to-text step), LLM, and TTS (Text-To-Speech, the speech-out step) into a streaming voice agent. At the transport layer, LiveKit and Daily handle WebRTC media routing (WebRTC is the browser-native low-latency audio/video protocol; "media routing" means relaying packets between participants with as little buffering as possible). This section walks through each piece, explains how they fit together, and provides a concrete recipe for an end-to-end open-source voice agent that hits sub-second TTFAT (Time To First Audio Token, the perceptual-latency metric that Section 39.4 defines: how long from end-of-user-speech until the model's first speech sample starts playing).

Prerequisites

This section assumes the closed-API realtime architecture from Section 39.3, the latency-budget vocabulary from Section 39.4, and the open-weights model zoo for speech and audio from Section 25.4.

Three-layer stack diagram: Moshi or pipeline-LLM at the model layer, Pipecat or LiveKit Agents at orchestration, LiveKit or Daily at transport, all running on commodity hardware — **Figure 39.5.1**: The open-source realtime stack. Each layer has multiple interchangeable options; the figure shows the most production-ready 2026 choices.

39.5.1 Moshi: A Native Audio-Text Foundation Model

Fun Fact

Kyutai released Moshi at a 2024 Paris press event by booting it live, asking the model to discuss French wine, and then deliberately interrupting it mid-sentence to see how it handled barge-in. The model paused, acknowledged the interruption, and continued. The lab's stated mission was "to open-source what OpenAI demos behind a paywall," and the entire weights, codec, and training code shipped publicly within weeks. Moshi remains the only frontier voice model in the open-weights tier whose name nobody can pronounce identically twice.

Kyutai's Moshi (Defossez et al., 2024) is the first widely-used open-weights native audio-text model. The technical novelty is full-duplex: the model has two parallel audio streams (one for each side of the conversation) and can speak while listening. This is a step beyond GPT-4o Realtime, which alternates turns.

Architecture highlights:

Mimi codec: a 12.5 Hz neural codec producing 8 codebook tokens per frame, for ~100 audio tokens/sec/stream. Lower token rate than GPT-4o means cheaper inference.
Hierarchical transformer: a "temporal" transformer over frames and a "depth" transformer over codebooks. This factorization keeps the per-step cost manageable.
Inner monologue: the model emits a text stream alongside the audio stream, so you get a real-time transcript for free.
Open weights: released under permissive license. The 7B variant runs at real-time on a single L4 GPU.

# Moshi inference via the kyutai-labs/moshi package.
# Full-duplex audio: user and model audio are concurrent streams.
from moshi import load_moshi
import torch, sounddevice as sd, numpy as np

model, mimi, tokenizer = load_moshi(
    checkpoint="kyutai/moshiko-pytorch-bf16",
    device="cuda",
)
FRAME = 1920   # 24kHz audio, 12.5Hz codec = 80ms frames

def run():
    with sd.Stream(samplerate=24000, channels=1,
                   blocksize=FRAME) as stream:
        user_tokens = []
        model_tokens = []
        while True:
            # Read 80ms of microphone audio
            audio, _ = stream.read(FRAME)
            # Encode to user codec tokens
            u_tok = mimi.encode(torch.from_numpy(audio).cuda())
            user_tokens.append(u_tok)
            # Model emits one frame of (audio tokens, text token) in response.
            # Both streams advance concurrently; full-duplex.
            m_tok, m_text = model.step(u_tok)
            model_tokens.append(m_tok)
            # Decode and play model audio
            m_audio = mimi.decode(m_tok)
            stream.write(m_audio.cpu().numpy())
            # Real-time text transcript
            if m_text:
                print(tokenizer.decode(m_text), end="", flush=True)

Code Fragment 39.5.1a: Moshi's inference loop. The user and model audio streams are processed in lockstep, 80 ms at a time. The model emits both audio codec tokens and text tokens; the text gives you a real-time transcript with zero additional latency.

Key Insight: Why full-duplex matters

Half-duplex models (the default) treat conversation as alternating turns: the user speaks, then the model speaks. Full-duplex models can do both at once: backchanneling ("uh-huh", "mhm"), overlapping with the user's "yes!" before they finish, gracefully handling rapid-fire dialog. This is closer to how humans actually converse and dramatically improves the perceived naturalness. Moshi was the first open model to demonstrate this at production quality; expect commercial APIs to follow.

39.5.2 Pipecat: A Pipeline Framework for Voice Agents

Pipecat (Daily.co, 2024) is the most widely-used open-source framework for composing voice agent pipelines. The abstractions: a pipeline is a directed graph of processors that pass frames (audio chunks, transcripts, LLM tokens, TTS audio) through. The framework handles backpressure, error recovery, and observability.

# Pipecat pipeline: Daily transport, Deepgram STT,
# OpenAI GPT-4o-mini, Cartesia TTS. Voice agent in ~40 lines.
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.task import PipelineTask
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.openai import OpenAILLMService
from pipecat.services.cartesia import CartesiaTTSService
from pipecat.transports.services.daily import DailyTransport

transport = DailyTransport(room_url=URL, token=TOKEN, bot_name="Agent")
stt = DeepgramSTTService(api_key=DEEPGRAM_KEY)
llm = OpenAILLMService(api_key=OPENAI_KEY, model="gpt-4o-mini")
tts = CartesiaTTSService(api_key=CARTESIA_KEY, voice_id="79a125e8-...")

pipeline = Pipeline([
    transport.input(),          # Mic in from Daily room
    stt,                         # Audio -> partial transcripts
    llm,                         # Transcripts -> LLM tokens
    tts,                         # LLM tokens -> audio
    transport.output(),         # Audio -> Daily room
])

task = PipelineTask(pipeline, params={"enable_metrics": True})
asyncio.run(task.run())

Code Fragment 39.5.2: Pipecat voice agent in 40 lines. The framework handles backpressure (slow TTS causes upstream to pause), interruption (configurable via the transport), and metrics emission. Add an LLM tool-call adapter or a custom processor to extend.

Pipecat ships connectors for most production services: Deepgram, AssemblyAI, OpenAI, Anthropic, Cartesia, ElevenLabs, Replicate, Whisper, plus transport options (Daily, LiveKit, Twilio, WebRTC, WebSocket). Replacing any one piece is a one-line change. This makes Pipecat the right starting point for most open-source voice agent deployments in 2026.

39.5.3 LiveKit Agents

LiveKit Agents (LiveKit, 2024) is a similar framework with deeper WebRTC integration. LiveKit's heritage is real-time communication infrastructure (Zoom-style video rooms); the Agents framework adds AI participants that can join those rooms.

The conceptual model: a LiveKit room is a meeting; an agent is a participant in that room with the same audio I/O affordances as a human participant. The agent uses LiveKit's media servers for low-latency WebRTC transport across geographic regions.

Strengths: best-in-class WebRTC performance, geographic scale (LiveKit Cloud has global media servers), native support for video as well as audio.
Weaknesses: more opinionated about transport (LiveKit only); newer, smaller community than Pipecat.

The choice between Pipecat and LiveKit Agents is largely about transport: if you are deploying on a phone call (Twilio) or a custom WebSocket, use Pipecat. If you are building a video-call experience (e.g., a virtual interviewer or a tutoring widget), use LiveKit Agents.

39.5.4 Transport Layer: LiveKit, Daily, Twilio

The transport layer carries audio between the user device and the agent. Three production-grade choices:

Provider	Protocol	Best Use Case	Pricing (early 2026)
LiveKit Cloud	WebRTC + LiveKit signaling	Video calls, multi-participant agents	~$0.004 per participant-minute
Daily	WebRTC + Daily SDK	Browser-based agents, simple integration	~$0.004 per participant-minute
Twilio Voice + Media Streams	SIP + WebSocket media	Phone call integration	~$0.015 per minute
Plain WebSocket	WebSocket binary frames	Custom mobile apps, server-to-server	Self-hosted
Vapi	Managed voice infrastructure	Quick start, no own infra	~$0.05 per minute (bundled)

Figure 39.5.2a: Transport options for voice agents. For most consumer products, LiveKit or Daily; for telephony, Twilio; for managed simplicity at higher cost, Vapi.

39.5.5 Deployment and Scaling

An open-source voice agent typically runs as a Python service in a container, one process per concurrent conversation. Scaling involves three considerations:

GPU residency for self-hosted models: if you run Moshi or a self-hosted Whisper, you need GPUs warm and waiting. A single A10G can serve 2 to 4 concurrent Moshi conversations; a single H100 can serve 8 to 12. Use Modal (Section 9.5) or RunPod for elastic GPU pools.
Network proximity: deploy near your users. LiveKit Cloud's global media servers handle this transparently; with self-hosted media, you need multi-region deployment.
Concurrency limits: each conversation holds open sockets to ASR, LLM, TTS providers. Stay under the per-vendor rate limits or roll out across multiple API keys.

Note: Self-hosted vs API trade-off in voice

The cost crossover for self-hosted vs API depends on your concurrency profile. At low utilization (a few concurrent conversations per hour), API-based pipelines are radically cheaper because you pay only for actual usage. At high steady-state load (50+ concurrent conversations 24x7), self-hosted models on rented GPUs become cheaper per minute. The breakeven is typically around 10 to 20 concurrent conversations; below that, stay on APIs.

Research Frontier: The Emerging Open Frontier

Beyond Moshi, the 2025-2026 open-source landscape includes:

Hertz-dev: an 8B open audio-text model from Standard Intelligence with low latency on consumer GPUs.
Llama-4-Omni: covered in Section 37.4; the open-weights GPT-4o competitor with realtime audio path.
Mini-Omni 2: open audio-text-vision model with full-duplex capability.
RT-Voice (formerly Whisper-Large-v3-Turbo): distilled Whisper variants that run at near-real-time on CPU.

The pattern: open models are catching up to proprietary ones with 6 to 12 months of lag. For applications where ownership, on-premises deployment, or fine-tuning matters more than absolute capability, the open stack is increasingly viable.

Real-World Scenario: A Privacy-First Voice Therapy Bot

Who: A 2025 mental-health startup building a voice-based therapy companion subject to HIPAA-style data-handling rules.

Situation: The product needed a fully voice-driven agent, but the startup had committed to its clinician advisors that no user audio would leave its own infrastructure.

Problem: Every off-the-shelf low-latency voice stack (GPT-4o Realtime, Gemini Live) required shipping user audio to a major cloud provider, which the startup could not do.

Dilemma: Accept the proprietary realtime APIs' latency advantage and breach the privacy commitment, or stand up a self-hosted stack and absorb the engineering cost plus higher per-minute latency.

Decision: The team chose a fully self-hosted open stack, accepting higher latency in exchange for keeping all audio inside its own VPC.

How: The stack used Pipecat for orchestration, faster-whisper-large-v3 for ASR on a self-hosted A10G, an internally fine-tuned Llama-3.1-70B for the LLM, Kokoro for on-prem TTS, and LiveKit for WebRTC transport.

Result: Total per-conversation cost was around $0.18 per minute (mostly GPU rental) with about 1.1 s TTFAT, versus a GPT-4o Realtime baseline of around $0.50 per minute at 0.4 s TTFAT; the open stack was 3x cheaper, and the latency was acceptable for therapy-style turn-taking.

Lesson: When regulatory or privacy requirements forbid sending audio to a hyperscaler, the open self-hosted voice stack is the only option, and the latency penalty is usually tolerable for non-instant-response use cases.

Warning: Open-stack engineering cost is real

The "free" in open-source means free of license fees, not free of engineering effort. A self-hosted voice agent typically requires 1 to 3 engineers to bring to production and 0.5 to 1 engineer to maintain. If you do not have those engineering resources, the proprietary realtime APIs are not just simpler, they are likely cheaper in total cost of ownership. Match the build-vs-buy decision to your team capacity, not to the per-minute API price.

Key Insight

The 2026 open-source realtime stack is production-ready but engineering-intensive. Moshi is the native audio-text foundation model with full-duplex capability; Pipecat and LiveKit Agents are the orchestration frameworks; LiveKit and Daily handle WebRTC transport. The open stack wins on privacy, ownership, and high steady-state cost; the proprietary realtime APIs win on simplicity, latency floor, and low-volume cost. Match the stack to your concurrency, compliance, and team-capacity profile.

Self-Check

Q1: Moshi is described as "full-duplex". What does that enable that GPT-4o Realtime cannot do, and why is it harder to train?

Show Answer

Full-duplex means the model has two parallel audio streams (one per speaker) and can listen and speak simultaneously, enabling natural backchanneling such as "uh-huh" or "mhm", graceful overlap with the user's "yes!" before they finish, and rapid-fire dialog where both sides talk over each other. GPT-4o Realtime alternates turns, so it cannot generate audio while the user is still speaking. Training full-duplex is harder because you need paired audio data where two speakers overlap, the model must learn when to interject versus stay silent, and the hierarchical transformer architecture must factorize the temporal and codebook dimensions to keep per-step cost tractable.

Q2: Sketch how you would replace the OpenAI LLM service in the Pipecat example with a self-hosted Llama-4-Omni-8B. Which other components would you adjust?

Show Answer

Swap OpenAILLMService for a Pipecat LLM service that points at your self-hosted inference endpoint (vLLM or TGI serving Llama-4-Omni-8B over an OpenAI-compatible API), keeping the same pipeline.Pipeline call signature. Because Llama-4-Omni is multimodal, you can optionally remove the separate STT and TTS stages and feed audio frames directly through the LLM, collapsing the pipeline. If you stay with the pipeline shape, you must provision a GPU (A10G can host one or two concurrent conversations; H100 handles 8 to 12), warm the model on container startup so cold-start does not break TTFAT, and add concurrency caps so each conversation gets its own KV cache slot.

Q3: For a 20-conversations-per-day customer support bot, which stack (open vs API) is cheaper and why? At 2000 conversations per day?

Show Answer

At 20 conversations per day the API stack is cheaper. GPU rental for self-hosting bills 24x7 regardless of utilization, so a single L4 sits mostly idle and still costs a few dollars per hour; APIs bill only per-minute, so at low concurrency the marginal cost is tiny. The break-even is typically 10 to 20 concurrent conversations of sustained load. At 2000 conversations per day, especially if those are concentrated in business hours, you reach steady-state concurrency where self-hosted GPUs amortize favorably and the open stack becomes cheaper per minute. The textbook's privacy-therapy example shows roughly $0.18/min self-hosted versus $0.50/min for GPT-4o Realtime once utilization is high.

Q4: The transport-layer table lists Twilio at 3x the price of LiveKit and Daily. What does that premium buy you?

Show Answer

Twilio's premium buys SIP-based telephony integration: the ability to receive an actual phone call on a Twilio-provisioned number, route it through SIP, and stream the call audio as a WebSocket media stream to the agent. LiveKit and Daily are WebRTC providers optimized for browser-to-server audio, which is cheaper because WebRTC media servers carry less regulatory and PSTN-interconnect overhead. If your product is a phone-based bot (healthcare scheduling, restaurant reservations, call-center automation), Twilio's price covers the cost of plugging into the traditional telephone network. If your product is browser- or app-based, LiveKit or Daily are dramatically cheaper at equivalent quality.

What Comes Next

Chapter 39 closes here. The next chapters in Part VII (39, 40, 41) cover Vision-Language-Action models, LLM robotics, and world models, then we return to Chapter 33 for cross-modal retrieval-augmented generation.

Further Reading

Moshi

Defossez, A., Mazare, L., Orsini, M., et al. (2024). "Moshi: A speech-text foundation model for real-time dialogue." Kyutai Labs. arXiv:2410.00037

Kyutai. (2024). "kyutai-labs/moshi: Open-source full-duplex audio language model." github.com/kyutai-labs/moshi

Orchestration Frameworks

Daily.co. (2024). "Pipecat: An open-source Python framework for building voice and multimodal conversational agents." docs.pipecat.ai

LiveKit. (2024-2025). "LiveKit Agents framework." docs.livekit.io/agents

Open-Source ASR and TTS

Gandhi, S., von Platen, P., & Rush, A. M. (2023). "Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling." Hugging Face. arXiv:2311.00430

Hexgrad. (2025). "Kokoro: 82M parameter open TTS model." huggingface.co/hexgrad/Kokoro-82M

WebRTC and Transport

LiveKit. (2024). "Why WebRTC matters for AI applications." blog.livekit.io