"Speech recognition went from a research problem in 2020 to a solved commodity in 2023. Whisper open-sourced the moat, and now every multimodal agent has ears."
Echo, Transcription-Eager AI Agent
Automatic speech recognition (ASR) is the underrated workhorse of multimodal AI. It is the modality input that lets voice assistants exist; the captioning layer beneath every consumer audio interface; the front door of every realtime conversational agent (Section 38 covers the realtime stack in detail). The 2022 release of OpenAI's Whisper, and especially the 2023 large-v3 update, collapsed ASR from a per-language R&D effort into a single multilingual model with state-of-the-art quality, robust to accents and noise, and small enough to run on a laptop. The 2024 explosion of faster-whisper, whisper.cpp, and commercial alternatives (AssemblyAI, Deepgram, Speechmatics) turned that capability into a $0.001-per-minute commodity. This section walks the open-source and commercial landscape, the realtime-vs-batch axis, and how to integrate ASR cleanly with the rest of the multimodal stack.
Prerequisites
This section assumes the sequence-to-sequence transformer architecture from Section 4.4 and the conditional-LM intuition from Section 6.2. The TTS pipelines in Section 20.1 are the symmetric inverse problem worth comparing against.
20.5.1 Whisper and the End of Per-Language ASR
Before Whisper (Radford et al., OpenAI, 2022, arXiv:2212.04356), state-of-the-art ASR was a per-language, per-domain, per-acoustic-condition effort. You trained Kaldi or a wav2vec 2.0 model on hundreds of hours of in-domain transcribed speech, evaluated word error rate (WER) on a held-out set, and shipped the model with a long list of caveats about which microphones, accents, and noise conditions it could handle. Switching languages meant a new training run.
Whisper changed all of this with one structural decision: train an encoder-decoder transformer on 680,000 hours of weakly supervised multilingual speech-and-transcript pairs scraped from the open web. The model is a relatively standard Transformer (encoder consumes 80-bin log-mel spectrograms; decoder emits BPE tokens), but the scale of the data plus the diversity of sources (broadcast, podcasts, lectures, casual recordings) produces a model that generalizes across 99 languages with one weight file. Whisper-large-v3 (November 2023) further improved English WER on hard benchmarks to within 1% of human transcription on clean speech and significantly closed the gap on noisy and accented speech.
The single most important lesson from Whisper is that 680k hours of weakly supervised (sometimes mistranscribed, sometimes captioned by a different speaker) data outperforms 10k hours of professionally transcribed data, by enough to compensate for the noisier labels. This is the speech-recognition version of the scaling-laws lesson from Section 7.1 of Part II: the data quantity-quality tradeoff is heavily quantity-biased once you cross a few hundred thousand hours. Whisper's encoder-decoder transformer is not architecturally novel; the novelty is in the data curation, and the resulting model is the proof that the scaling-laws thesis transfers from text to audio.
20.5.2 faster-whisper, whisper.cpp, and Laptop-Scale ASR
Whisper's reference PyTorch implementation runs at roughly 1x real-time on a CPU and 5-10x real-time on a consumer GPU. For most production use cases this is too slow and too expensive. Two open-source projects dramatically improved the cost picture. faster-whisper (Klein, 2023) re-implements Whisper inference using the CTranslate2 runtime, with int8 quantization and aggressive operator fusion; it runs Whisper-large-v3 at 4x faster than reference PyTorch on the same GPU and 2x faster on CPU. whisper.cpp (Gerganov, 2022) ports the model to pure C++ with GGML quantization, enabling inference on M1/M2 MacBooks, Raspberry Pi 5, and even smartphones. A typical iPhone 15 Pro can transcribe an hour of audio offline in roughly 8 minutes using whisper.cpp's medium-q5 quantized model.
The architectural trick in both projects is the same: aggressive quantization (int8 or 5-bit), fused attention kernels (FlashAttention or hand-tuned equivalents), and a streaming KV cache implementation that processes 30-second audio windows in parallel. faster-whisper is the default choice for a server-side Python pipeline; whisper.cpp is the default for edge and mobile deployments.
# Transcribing an audio file with faster-whisper
# pip install faster-whisper
from faster_whisper import WhisperModel
# Large-v3 in int8 fits in 4 GB of VRAM; medium fits in 2 GB.
model = WhisperModel(
"large-v3",
device="cuda",
compute_type="int8_float16", # int8 weights, fp16 activations
)
segments, info = model.transcribe(
"interview.wav",
beam_size=5, # beam search for better quality on hard utterances
language=None, # None = auto-detect from the first 30s
vad_filter=True, # Silero VAD removes long silences (huge speedup)
word_timestamps=True, # per-word alignment via cross-attention
condition_on_previous_text=False, # avoid hallucination cascades
)
print(f"Language {info.language} "
f"(prob {info.language_probability:.2f})")
for seg in segments:
print(f"[{seg.start:7.2f} -> {seg.end:7.2f}] {seg.text.strip()}")
for word in seg.words or []:
print(f" {word.start:7.2f} {word.word!r}")
condition_on_previous_text=False setting is critical for long-form audio: leaving it True means each window sees the previous window's prediction as a prompt, which causes hallucination cascades when a window is silent or noisy. This is the single most common production gotcha.20.5.3 The 2026 Commercial ASR Landscape
Three commercial vendors dominate production ASR alongside the open Whisper ecosystem: AssemblyAI, Deepgram, and Speechmatics. Each occupies a slightly different niche.
Deepgram ships the lowest-latency streaming ASR on the market. Its Nova-3 model (2024) achieves sub-200 ms time-to-first-word and sub-100 ms incremental updates on streaming endpoints, with WER competitive with Whisper-large-v3 on English. Deepgram is the default choice for real-time voice agents, IVR replacement, and live captioning where latency matters more than the absolute best WER.
AssemblyAI ships the richest analytical layer on top of transcription. Its Universal-2 model (2024) bundles diarization (who-said-what), sentiment analysis, content moderation, named-entity detection, summary generation, and topic detection in one API call. For meeting transcription, podcast post-production, and call-center analytics, AssemblyAI's all-in-one pricing typically beats stitching three or four separate models together.
Diarization answers 'who spoke when' without knowing the speakers in advance. The audio is first split into short homogeneous segments at candidate speaker-change points, often using a neural voice-activity and change detector. Each segment is then mapped to a fixed-length speaker embedding (an x-vector or d-vector) by a network trained on speaker-verification so that same-speaker segments land close together regardless of words. Finally a clustering step, typically agglomerative or spectral clustering over the cosine similarities, groups the embeddings into speaker identities and labels every segment. End-to-end neural diarizers fold all three steps into one model and handle overlapping speech better, which classical clustering pipelines struggle with.
Speechmatics targets enterprise and multilingual use cases. Its Ursa model (2024) supports more accents per language than competitors (e.g., 14 distinct English accent variants), ships with stronger language support for languages like Arabic, Bengali, and Mandarin, and offers on-premise deployments for regulated industries (banking, healthcare, defense) that cannot send audio to a cloud vendor.
The pricing has converged to roughly $0.005 to $0.012 per minute of audio for batch transcription and $0.012 to $0.025 per minute for streaming. Self-hosting faster-whisper on a single A10G drops this to about $0.0005 per minute amortized; the break-even point versus a commercial API is around 50 hours of audio per month.
| Provider | Model | Streaming TTFW | Batch WER (EN) | Languages | Price (batch) |
|---|---|---|---|---|---|
| OpenAI Whisper API | whisper-1 (large-v3 derivative) | N/A (batch only) | ~5.5% | 99 | $0.006/min |
| OpenAI gpt-4o-transcribe | GPT-4o audio encoder | ~300 ms | ~5.0% | 57 | $0.006/min |
| Deepgram Nova-3 | Custom transformer | ~180 ms | ~5.7% | 30+ | $0.0043/min |
| AssemblyAI Universal-2 | Custom conformer | ~600 ms | ~5.2% | 99 | $0.012/min (analyses included) |
| Speechmatics Ursa | Custom conformer | ~700 ms | ~5.3% | 50+ | $0.010/min |
| faster-whisper (self-host) | Whisper-large-v3 | ~1.2 s | ~5.5% | 99 | ~$0.0005/min on A10G |
| Google Cloud Speech-to-Text | Chirp 2 | ~500 ms | ~6.1% | 125 | $0.016/min |
The Conformer is the speech-encoder architecture behind most modern ASR systems because it captures both global and local acoustic structure. Each block interleaves a self-attention module, which models long-range dependencies across the whole utterance, with a depthwise convolution module, which captures the fine local patterns (formant transitions, plosive bursts) that attention alone smears. The block is wrapped in two half-step feed-forward layers in a sandwich, a design borrowed from Macaron nets. This convolution-attention pairing is why Conformers beat both pure-CNN and pure-transformer encoders on word error rate, and why they appear as the acoustic backbone in production transducer and attention ASR models alike.
20.5.4 Real-Time vs Batch: The Architectural Split
Batch ASR processes a complete audio file and emits a transcript. Real-time (streaming) ASR processes audio as it arrives, emits a partial transcript that updates as more audio is buffered, and ultimately commits each segment when the model is confident it will not revise. The two regimes have fundamentally different architectural constraints.
Batch models can use bidirectional attention (the encoder sees the full audio in both directions), beam search over multiple hypotheses, and post-processing that depends on the full transcript (sentence punctuation, capitalization, formatting). Whisper's reference architecture is a batch model: each 30-second window is processed independently with bidirectional encoder attention, and the decoder runs beam search with a beam width of 5 by default. WER on batch transcription is typically 0.3-0.7 percentage points better than the same model in streaming mode.
Streaming models must commit to predictions before seeing the future audio. This rules out batch beam search and full-utterance bidirectional attention. Production streaming ASR uses one of two patterns: (1) chunked attention with limited right-context (the model sees 100-500 ms of future audio before emitting a token), or (2) RNN-T (transducer) architectures that are designed for streaming from the ground up. Deepgram, Google Chirp 2, and most commercial streaming endpoints use RNN-T derivatives; OpenAI's gpt-4o-transcribe uses chunked attention with a custom audio encoder.
The RNN-Transducer is the dominant streaming-ASR architecture because it emits text without waiting for the full utterance. It combines three small networks: an audio encoder over the incoming frames, a prediction network that acts like a language model over emitted tokens, and a joiner that merges the two and scores, for every (frame, token-history) pair, the next token plus a special blank symbol. Emitting blank advances the audio one frame; emitting a real token advances the text. Summing over all valid blank/token alignment paths gives the training loss, and at inference the model decodes monotonically frame by frame, which is why it streams where an attention decoder that re-reads the whole input cannot.
Whisper has a well-documented failure mode: when a 30-second window contains only silence or non-speech audio (music, traffic, applause), the model occasionally emits a plausible-sounding hallucination that has nothing to do with the audio. The most common hallucinations are repetitions of training-set phrases ("Thank you for watching", "Please like and subscribe", "Subtitles by..."). This is a serious problem for archive transcription, surveillance audio, and any pipeline where silence detection is part of the input. Always run a voice-activity detector (VAD) like Silero VAD before Whisper and skip windows that VAD scores as fully silent. faster-whisper's vad_filter=True does this automatically.
20.5.5 ASR as a Multimodal Input
The integration pattern for ASR in a multimodal agent is converging on two architectures. The first is the cascade: audio -> ASR -> text -> LLM -> reply. The LLM sees the transcript and never sees the raw audio. The second is direct audio input: the LLM has a native audio encoder (GPT-4o's audio modality, Gemini 2 Pro's audio input, Llama 4 Scout) and consumes audio as one of its input modalities directly. Both have advantages.
The cascade is more flexible: the ASR layer is independently swappable, the transcript is auditable for compliance, and the LLM can use any text-only model. The direct audio approach captures prosodic information (sarcasm, emotion, hesitation, accent) that pure transcription strips out, and it cuts latency by one round trip. Production agents in 2026 typically run the cascade by default for compliance reasons (auditability matters) and offer direct audio for premium tiers where prosody preservation justifies the higher per-call cost.
A B2B SaaS company processing 50,000 customer-support calls per month routes every call through this pipeline: (1) Deepgram Nova-3 produces streaming transcripts for live agent assist (sub-200 ms), (2) AssemblyAI Universal-2 batch-transcribes the post-call recording with diarization, sentiment, and entity extraction, (3) a GPT-4o summarization step produces a 200-word call summary, action items, and risk flags, (4) the full bundle is indexed in a vector database for compliance review. Total cost is about $0.03 per 10-minute call. Replacing the QA portion of this workflow with humans would cost $5-10 per call, so the unit economics shift two orders of magnitude.
20.5.6 Multilingual and Code-Switching ASR
Whisper's biggest strength outside English is its handling of code-switching: a single utterance that mixes two or more languages ("OK so let me check my agenda, attendez une seconde, dans la salle de réunion at 3pm"). Earlier per-language ASR models routed every utterance through one language model; code-switched speech produced gibberish. Whisper, trained on weakly supervised multilingual data that occasionally contains code-switched transcripts, handles short code-switching natively without any special configuration. The 2024 commercial ASR providers caught up on this dimension; AssemblyAI Universal-2 and Speechmatics Ursa both explicitly support code-switched output.
Low-resource languages remain a challenge. Whisper-large-v3 has acceptable performance (WER under 25%) on about 80 of its 99 supported languages; the remaining 19 (including several African languages, regional dialects, and indigenous languages) have WER over 40% on out-of-domain audio. The fixes in 2025-2026 have been: (1) Meta's SeamlessM4T-v2 (Communication, 2023) which targets 100 source and target languages for unified speech-to-text and speech-to-speech translation, and (2) community fine-tunes of Whisper for specific underserved languages (Catalan, Basque, Bengali) that match commercial English WER on in-domain data.
Whisper users in 2023 quickly noticed that on silent or noisy windows the model would sometimes emit phrases like "Thank you for watching!", "Please subscribe!", or "Subtitles by Amara.org". This is direct evidence that a meaningful fraction of Whisper's 680k-hour training corpus came from YouTube videos with auto-generated or amateur subtitles, and that the model learned to associate the audio statistics of silence-at-end-of-video with these closer phrases. It is a clean example of training-data biases bleeding into model behavior at deployment.
20.5.7 The Future: Direct-Audio LLMs
The 2024-2025 wave of multimodal LLMs (GPT-4o, Gemini 2 Pro, Llama 4) ship native audio encoders that bypass the ASR transcription step entirely. The model has a Whisper-style log-mel encoder fused with the LLM's transformer; audio tokens flow alongside text tokens in the same forward pass. This produces qualitatively different agent behavior: the model can detect sarcasm, identify a speaker's emotional state from voice tremor, distinguish a child's voice from an adult's, and reason about background sounds (a dog barking, an ambulance siren, a kitchen timer beeping). Pure ASR transcripts strip all of this away.
For most B2B deployments, the cascade architecture (Whisper plus a text LLM) will remain the right answer through 2026 because of auditability, cost predictability, and the maturity of the surrounding compliance tooling. For consumer voice products and real-time conversational agents, the direct-audio architecture will dominate. Section 38 on streaming multimodal covers the realtime stack in depth; here it suffices to note that ASR-as-a-separate-stage is a 2020-2024 pattern, and the direction of travel is toward audio-native LLMs that subsume it.
ASR is a commodity in 2026: Whisper-large-v3 plus faster-whisper gives you 5.5% WER on English at $0.0005 per minute self-hosted; commercial APIs (Deepgram for latency, AssemblyAI for analytics, Speechmatics for multilingual on-premise) layer on streaming, diarization, and analytics for a 10-30x markup justified by the surrounding integration value. The cascade architecture (ASR plus text LLM) remains the right default for compliance-heavy deployments; the direct-audio LLM (GPT-4o, Gemini 2 Pro) is the future for consumer voice. The single largest production gotcha is the silence-induced hallucination problem: always run a VAD before Whisper.
Show Answer
Show Answer
Show Answer
In the next section, Section 20.6: Video Diffusion Transformers (DiTs), we continue.
What Comes Next
Chapter 20 ends here, with audio fully covered from generation through editing through recognition. Section 20.6 opens on the modality with the steepest 2025-2026 capability gain: video. The same flow-matching DiT machinery that powers F5-TTS (Section 20.1) and Stable Audio (Section 20.4) underpins Sora 2, Veo 3, and Runway Gen-4.
Further Reading
- Radford, A. et al. (2022). Robust Speech Recognition via Large-Scale Weak Supervision (Whisper). arXiv:2212.04356. The OpenAI Whisper paper; foundational for the open-source ASR ecosystem.
- Klein, G. (2023). faster-whisper: CTranslate2-Based Whisper Inference. github.com/SYSTRAN/faster-whisper. The high-performance inference reimplementation used in most production self-hosted ASR.
- Gerganov, G. (2022). whisper.cpp: Pure C++ Whisper Implementation. github.com/ggerganov/whisper.cpp. The edge and mobile deployment reference.
- Communication, S. (2023). SeamlessM4T: Massively Multilingual Multimodal Machine Translation. arXiv:2308.11596. Meta's 100-language speech-to-text and speech-to-speech model.
- Deepgram (2024). Nova-3 Technical Report. deepgram.com/learn/introducing-nova-3. The current commercial leader in low-latency streaming ASR.
- AssemblyAI (2024). Universal-2: Speech-To-Text with Built-In Analysis. assemblyai.com/blog/universal-2. The analytics-bundled commercial alternative.
- Speechmatics (2024). Ursa: Multilingual Speech Recognition for Enterprise. speechmatics.com/product/ursa. The accent-and-language-coverage leader for enterprise on-premise.
- OpenAI (2024). gpt-4o-transcribe and gpt-4o-mini-transcribe. platform.openai.com/docs/guides/speech-to-text. The 2024 direct-audio LLM transcription endpoint.