Section 20.1: Text-to-Speech: VITS, Bark, and F5-TTS

"A voice is a fingerprint with a heartbeat. Neural TTS in 2026 prints that fingerprint from a five-second sample and forges the heartbeat to match."
Echo, Pitch-Perfect AI Agent

Big Picture

Text-to-speech (TTS) has crossed the uncanny valley. The pipeline of 2018 (text -> phonemes -> spectrogram -> vocoder, four separate models) has collapsed into single end-to-end systems that map characters or tokens directly to waveform-quality audio. Three lineages dominate: VITS-style conditional VAEs with adversarial training, codec language models (Bark, VALL-E, Tortoise) that treat audio as a sequence of discrete tokens, and flow-matching models (F5-TTS, Voicebox) that denoise mel-spectrograms in the same way image diffusion models denoise pixels. All three families share one architectural lesson from the rest of this book: stack transformer blocks on top of a learned audio representation, and the same recipe that produces ChatGPT produces Cartesia Sonic. This section is the foundation for everything that follows in Chapter 20: cloning (Section 20.2), music (Section 20.3), editing (Section 20.4), and ASR (Section 20.5).

Prerequisites

This section assumes the transformer mechanics from Section 3.1, the tokenization and vocabulary discussion from Section 1.6, and the autoregressive next-token loss from Section 6.2. A brief detour through diffusion-model basics (the image variant covered in Section 19.7) helps with the flow-matching half of the section.

Three TTS lineages compared: VITS, codec LMs, flow matching — **Figure 20.1.1:** The three 2024 to 2026 TTS lineages. VITS (Kim et al. 2021) is a conditional VAE with adversarial training, text encoder, stochastic duration predictor, a normalizing flow on the prior, and HiFi-GAN decoder; the architectural ancestor of most open-source TTS, MOS within 0.1 of natural recordings on LJSpeech. Codec language models (Bark, VALL-E, Tortoise) treat audio as 75 tokens per second per book across 8 EnCodec codebooks and run a transformer stack to autoregress them; this is the recipe behind ElevenLabs v3, Cartesia Sonic, and PlayHT 2.0. Flow-matching models (F5-TTS at 333M parameters, Voicebox) abandon phoneme tokenizers, duration predictors, and alignment search in favor of a DiT that denoises mel-spectrograms conditioned on byte-level text, yielding 5-second zero-shot voice cloning and 0.15 real-time factor on an A100.

20.1.1 From Tacotron to End-to-End Neural TTS

Modern TTS has three jobs: convert text into a phonetic plan, predict an acoustic feature sequence (typically a mel-spectrogram or a stream of audio codec tokens), and turn that feature sequence into a waveform. Before 2021, each job was a separate neural network. Tacotron 2 (Shen et al., 2018) handled text-to-spectrogram with an attention-based seq2seq, and a WaveNet or HiFi-GAN vocoder turned the spectrogram into 16-bit PCM samples at 22 or 24 kHz. The cracks in this pipeline showed at the seams: training data mismatch between modules, exposure bias when the spectrogram predictor saw its own outputs at inference, and a long tail of failure modes where the attention would skip or repeat words.

The end-to-end shift fused these stages. VITS (Kim et al., 2021) trained the text encoder, the duration predictor, the latent z-sequence, and a HiFi-GAN-style decoder jointly using a variational lower bound plus an adversarial loss. The model never explicitly emits a spectrogram at inference; the latent z passes straight into the GAN-trained decoder, and you read waveform out the other end. This eliminated the mismatch problem and pushed the mean opinion score (MOS) within 0.1 of natural recordings on LJSpeech.

Key Insight: The audio codec is the audio tokenizer

In Section 22.1 a tokenizer maps text into discrete IDs that a transformer can consume. The 2023-2024 generation of TTS uses the same idea for audio: a neural codec (EnCodec, SoundStream, Mimi, DAC) compresses 24 kHz audio into 50 to 75 discrete tokens per second across 4 to 32 residual codebooks, and a transformer language model is trained to predict the next codec token given text. Bark, VALL-E, and Tortoise are all instances of this recipe; their differences are which codec, how many codebooks they autoregress, and how much data they trained on. Once you internalize that audio is just another vocabulary, every trick you know about LLM training (KV caching, FlashAttention, RoPE) transfers wholesale.

20.1.2 VITS: The Flow-Based End-to-End Baseline

VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is the architectural ancestor of most open-source neural TTS in 2026. It treats speech generation as a conditional VAE problem with four core modules: a text encoder that produces a sequence of phoneme-level latents, a stochastic duration predictor that decides how long each phoneme should last, a normalizing flow that transforms the text-conditioned prior into a more expressive distribution over the latent z (so that the prior can match the richer posterior), and a HiFi-GAN decoder that maps z directly to waveform. The clever piece is the monotonic alignment search (MAS): a non-differentiable dynamic-programming routine inside the training loop that finds the best phoneme-to-frame alignment, eliminating the need for an external aligner like Montreal Forced Aligner.

The training objective combines a reconstruction loss on the waveform (mel-spectrogram L1 plus multi-period and multi-scale discriminators), a KL divergence between the prior and the posterior over z, and a duration loss over the phoneme-level durations. The result is a single network that takes characters or phonemes in and produces 24 kHz audio out, trainable on a single GPU in a few days on a dataset like LJSpeech.

VITS architecture diagram showing text encoder, posterior encoder, flow, monotonic alignment search, duration predictor, and HiFi-GAN decoder — **Figure 20.1.2**: The VITS architecture. The posterior encoder and the prior (text encoder plus flow) are aligned via monotonic alignment search during training. At inference, only the text encoder, flow, and HiFi-GAN decoder are active.

The VITS training objective is a conditional variational lower bound augmented with HiFi-GAN adversarial and feature-matching losses. Writing $x$ for the waveform, $c$ for the phoneme sequence, $z$ for the latent acoustic representation, $q(z \mid x)$ for the posterior encoder, and $p(z \mid c)$ for the text-conditional prior (text encoder followed by a normalizing flow), the loss is

\mathcal{L}_\mathrm{VITS} = \mathbb{E}_{q(z \mid x)}\bigl[-\log p_\theta(x \mid z)\bigr] + \mathrm{KL}\bigl(q(z \mid x) \,\|\, p(z \mid c)\bigr) + \mathcal{L}_\mathrm{dur}(c) + \mathcal{L}_\mathrm{adv} + \lambda_\mathrm{fm} \mathcal{L}_\mathrm{fm},

where the first term is the HiFi-GAN reconstruction term, the KL pulls the text-conditional prior toward the posterior given the speech, $\mathcal{L}_\mathrm{dur}$ trains the stochastic duration predictor (a normalizing flow on phoneme durations), and the last two are the adversarial and feature-matching losses on the synthesized waveform $\hat{x}$. The KL is computed with respect to the monotonic alignment search (MAS) result, which finds the discrete alignment between phonemes and latent frames that maximizes the prior likelihood given the posterior sample. Because MAS is non-differentiable, the discrete alignment is treated as a stop-gradient hard target and gradients flow only through $q$ and the flow.

Library Shortcut: VITS TTS Inference via Coqui TTS

from TTS.api import TTS
import torch

# Pretrained VITS on LJSpeech (single English speaker, 22 kHz).
device = "cuda" if torch.cuda.is_available() else "cpu"
tts = TTS(model_name="tts_models/en/ljspeech/vits", progress_bar=False).to(device)

# Direct waveform synthesis: text in, 22 kHz float waveform out.
wav = tts.tts(text="The quick brown fox jumps over the lazy dog.")
print(len(wav), "samples", f"({len(wav) / 22_050:.2f} s at 22 kHz)")
# 71680 samples (3.25 s at 22 kHz)

tts.tts_to_file(text="VITS skips spectrograms entirely.", file_path="out.wav")

# Multi-speaker variant: pick speaker by name from the speaker list.
vctk = TTS("tts_models/en/vctk/vits", progress_bar=False)
vctk.tts_to_file(text="Hello world.", speaker="p225", file_path="p225.wav")

Code Fragment 20.1.2a: VITS inference through the Coqui TTS wrapper. At inference time only the text encoder, normalizing flow, stochastic duration predictor, and HiFi-GAN decoder from Figure 20.1.2 are active; the posterior encoder and MAS routine exist solely for training. The LJSpeech checkpoint runs at roughly 0.05x real time on an RTX 4090; the VCTK multi-speaker checkpoint adds a one-hot speaker embedding to the duration predictor and decoder and supports 109 speakers from the original VCTK corpus.

Worked Example: VITS Latency and MOS on LJSpeech

Compare VITS to the Tacotron 2 + HiFi-GAN baseline on the LJSpeech benchmark. For a 5-second target utterance (about 25 phonemes, 215 mel frames at 22 kHz), Tacotron 2's autoregressive decoder emits one mel frame per step at roughly 700 frames/second on an RTX 4090, taking about 0.31 s, plus HiFi-GAN's roughly 0.05 s vocoder pass, for a total of about 0.36 s (real-time factor 0.072). VITS runs the entire stack in a single feed-forward pass (text encoder about 5 ms, flow about 8 ms, duration predictor about 2 ms, decoder about 35 ms) for roughly 50 ms total, a real-time factor of 0.010, around 7x faster than the Tacotron 2 pipeline. On naturalness, the published LJSpeech MOS is 4.43 for VITS versus 4.43 for ground-truth recordings (statistically indistinguishable within a 0.05 confidence interval), 4.07 for Tacotron 2 + HiFi-GAN, and 4.05 for FastSpeech 2 + MelGAN. An end-to-end model matching natural recordings on a controlled single-speaker corpus is what put VITS in the architectural foundation of XTTS, YourTTS, VITS-2, and most open-source TTS systems shipped between 2021 and 2026.

20.1.3 Bark and Codec Language Models

Suno AI's Bark (released April 2023) sits in a different lineage. Instead of a VAE plus flow, it is a stack of transformer decoders that autoregress over audio codec tokens. Bark uses EnCodec at 24 kHz with 8 residual codebooks, producing 75 tokens per second per book, or 600 tokens per second of audio total. The model decomposes this into three stages: a semantic transformer that predicts a coarse stream of semantic tokens (derived from a quantized HuBERT-like model), a coarse-acoustic transformer that predicts the first two EnCodec codebooks conditioned on the semantic stream, and a fine-acoustic transformer that fills in codebooks 3 through 8 conditioned on the first two. The fine model is non-autoregressive across codebooks but autoregressive across time.

This staged decomposition matters because audio is exquisitely high-dimensional. A naive flat autoregression over 600 tokens per second of 24 kHz speech would consume the entire context window of GPT-4 in 13 seconds of audio. Splitting semantics from acoustics, and coarse from fine, lets each stage operate at a manageable rate.

Note: Residual Vector Quantization Inside EnCodec and Mimi

The "8 codebooks at 75 Hz" recipe that powers Bark, VALL-E, and Moshi is a Residual Vector Quantizer (RVQ) bolted onto a convolutional encoder-decoder. The encoder is a stack of strided 1D convolutions: input 24 kHz waveform $\mathbf{w} \in \mathbb{R}^{T}$, output latent $\mathbf{z} \in \mathbb{R}^{T/320 \times d}$ at $d = 128$ channels. The 320x downsampling lands the frame rate at exactly 75 Hz for 24 kHz audio. The RVQ then quantizes each latent vector $\mathbf{z}_t$ with a chain of $K$ codebooks $\{\mathcal{C}_1, \ldots, \mathcal{C}_K\}$, each containing 1024 entries of dimension $d$. The chain quantizes the running residual:

r_0 = \mathbf{z}_t, \quad e_k = \arg\min_{c \in \mathcal{C}_k} \lVert r_{k-1} - c \rVert_2, \quad r_k = r_{k-1} - e_k, \quad \hat{\mathbf{z}}_t = \sum_{k=1}^{K} e_k.

The key property is that $K$ residual codebooks of size 1024 do not behave like a single flat codebook of size $1024^K$: each codebook captures a different scale of detail. Codebook 1 typically encodes the coarse spectral envelope (vowel quality, pitch register), codebook 2 captures finer prosodic and fricative detail, and later codebooks fill in the high-frequency noise components that affect perceived crispness. This is why low-bandwidth applications often drop the last few codebooks: dropping codebooks 7 and 8 cuts bitrate by 25% with a 0.1 MOS hit, while dropping codebook 1 destroys intelligibility.

Training combines a perceptual reconstruction loss (multi-scale STFT plus L1 on time-domain samples), an adversarial loss from multi-period and multi-scale discriminators (Defossez et al. use HiFi-GAN-style discriminators), and a commitment loss $\sum_k \lVert \mathrm{sg}(e_k) - r_{k-1} \rVert_2^2$ that pulls each residual toward its assigned codebook entry (the $\mathrm{sg}$ is the stop-gradient operator that prevents the codebook itself from chasing the encoder). To avoid dead codebook entries, EnCodec uses exponential-moving-average updates (each entry tracks the mean of its assigned latents with EMA decay 0.99) and a "random restart" for any entry that goes unused for too long. Mimi (Defossez et al., 2024) adds a critical twist: it freezes the first codebook of its RVQ and trains it with a distillation loss against a frozen WavLM/HuBERT encoder, so codebook 1 carries explicit semantic content. This is what makes Moshi a true speech-to-speech model: the first codebook stream is a clean semantic representation that the language model can reason over, and codebooks 2 through 8 carry the prosody and timbre needed for naturalistic synthesis. References: Defossez et al., "High Fidelity Neural Audio Compression" (EnCodec), arXiv:2210.13438 (2022); Zeghidour et al., "SoundStream," arXiv:2107.03312 (2021); Defossez et al., "Moshi," arXiv:2410.00037 (2024).

Figure 20.1.3 unrolls the residual chain: each codebook quantizes what the previous codebooks could not capture, so the running residual $r_k = r_{k-1} - e_k$ shrinks from coarse spectral envelope down to fine high-frequency detail across the $K$ stages.

Residual vector quantization chain. The latent vector z_t enters as the initial residual r_0. Codebook 1 finds its nearest entry e_1, which is subtracted to leave a smaller residual r_1. Codebook 2 quantizes r_1 to e_2, leaving r_2, and so on through codebook K. Each codebook captures a finer scale of detail: codebook 1 the coarse spectral envelope, middle codebooks prosody and fricatives, the last codebooks high-frequency crispness. The reconstructed latent is the sum of all the chosen entries e_1 through e_K. A note shows that dropping the last codebooks lowers bitrate with a small quality cost, while dropping the first destroys intelligibility. — **Figure 20.1.3**: Residual vector quantization (the tokenizer inside EnCodec and Mimi). The latent $\mathbf{z}_t$ becomes the initial residual $r_0$; each codebook $k$ snaps the running residual to its nearest entry $e_k$, then that entry is subtracted, so $r_k = r_{k-1} - e_k$ shrinks at every stage. The reconstruction $\hat{\mathbf{z}}_t = \sum_k e_k$ improves with more codebooks. Because codebook 1 captures the coarse spectral envelope and later codebooks fill in fine detail, low-bandwidth systems can drop the last few codebooks cheaply, but dropping codebook 1 destroys intelligibility, which is exactly why Mimi makes codebook 1 a distilled semantic stream.

Real-World Scenario: Bark, in production, today

The Bark recipe (or a close cousin) ships behind several popular voice products. ElevenLabs v3 (2024) is a closed-weights codec LM with proprietary improvements in expressivity and pacing. Cartesia Sonic (2024-2025) replaces the transformer backbone with a state-space model (Mamba-2 derivative) to achieve sub-100 ms time-to-first-byte on streaming TTS. PlayHT 2.0 ships a multilingual codec LM with voice cloning. Open-source Bark itself remains the cheapest way to prototype any codec-LM-based product without negotiating API keys.

20.1.4 F5-TTS: Flow Matching Meets TTS

F5-TTS (Chen et al., 2024, arXiv:2410.06885) is the 2024 state of the art in zero-shot open-weight TTS. It abandons both VAE and codec-LM lineages in favor of a flow-matching transformer that denoises mel-spectrograms conditioned on text. The "F5" in the title stands for "A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching" (five F-words; the authors lean into the pun). It is a 336M-parameter DiT (Diffusion Transformer; see Section 33.1 for the video DiT variant) trained on the Emilia multilingual dataset (~95k hours).

The architectural simplification is striking. F5-TTS has no phoneme tokenizer, no duration predictor, no alignment search, and no vocoder beyond an off-the-shelf BigVGAN. Text is byte-level. Audio conditioning at inference (the reference voice clip) is concatenated with masked target frames, and the DiT learns to fill in the missing mel frames given the text. Inference is 16 to 32 flow-matching steps, similar in cost to Stable Diffusion 3 at the same parameter count. The model handles zero-shot voice cloning from a 5-second prompt out of the box, code-switches between languages, and runs at roughly 0.15 real-time factor on a single A100.

# Synthesizing speech with Bark (Suno-AI's open-source codec LM)
# pip install bark
from bark import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav

# Download and cache the three transformer stages (semantic, coarse, fine).
# First call pulls ~5 GB; subsequent calls are local.
preload_models()

# Bark supports inline tags: [laughs], [sighs], [music], (whispering),
# as well as multilingual text and a small set of curated "history prompts".
prompt = (
    "Hello, my name is Suno. And, uh, I like pizza. [laughs] "
    "But I also like other things, like reading books and watching films."
)

audio_array = generate_audio(prompt, history_prompt="v2/en_speaker_6")
write_wav("bark_output.wav", SAMPLE_RATE, audio_array)
print(f"Generated {len(audio_array) / SAMPLE_RATE:.1f} seconds at {SAMPLE_RATE} Hz")

Output: Generated 11.3 seconds at 24000 Hz

Code Fragment 20.1.1a: Bark synthesis with the open-source bark library. The history_prompt selects from Suno's curated voice library; pass None to sample a random voice instead. Bark is unique among open TTS in supporting nonverbal sounds ([laughs], [sighs]) directly in the prompt because its semantic transformer was trained on transcripts that include them.

20.1.5 Prosody, Multi-Speaker, and Emotion Control

Naturalness is no longer the bottleneck; expressivity is. The 2024-2026 generation of TTS competes on prosody (pitch contour, stress, pacing), emotion (calm, angry, sad, excited), and speaker fidelity in multi-speaker contexts. There are four main control mechanisms in production.

The first is speaker embeddings, where a separate speaker-verification network (a wav2vec-XLSR or a similar self-supervised model) produces a 256-to-512-dimensional vector that conditions the TTS decoder. This is how VITS-2 and YourTTS extend single-speaker models to multi-speaker zero-shot synthesis. The second is reference-prompt conditioning, where a clip of target audio is concatenated with the input mel and the model learns to copy the style; F5-TTS, VALL-E, and NaturalSpeech 3 all use this trick. The third is natural-language style prompts: PromptTTS and Audiobox accept text like "an angry old man speaking quickly with a slight British accent" alongside the content text. The fourth is fine-grained controls exposed at inference: ElevenLabs v3 ships an "emotion" slider and a "stability" slider that act as classifier-free guidance weights on internal style channels.

Warning: Sliders are not language

The most common TTS production bug is treating the emotion slider as if it composes with itself. Setting "stability=0.1, similarity=0.9" gives a different output distribution than "stability=0.9, similarity=0.1", but neither is necessarily what a director means by "more angry, less crisp." Always A/B at least three random seeds per slider configuration when shipping a new voice; the variance across seeds at fixed sliders is often larger than the effect of moving a slider by a quarter.

20.1.6 The 2026 TTS Landscape: Open vs. Commercial

The commercial-open gap on raw audio quality is narrower than on most other modalities, but production reliability still favors the API vendors. Figure 20.1.4 summarizes the field as of late 2025.

Model	Year	License	Zero-Shot Clone	Languages	Streaming Latency
F5-TTS	2024	CC-BY-NC 4.0 (weights)	5 s reference	Multilingual (Emilia)	Batch only
Bark	2023	MIT	Limited (preset voices)	13 languages	~3 s (offline)
VITS-2 / XTTS-v2	2023	CPML (XTTS)	6 s reference	16 languages	~500 ms
ElevenLabs v3	2024	Closed API	30 s pro clone	32 languages	~250 ms first byte
Cartesia Sonic	2024	Closed API	3 s reference	15 languages	~75 ms first byte
PlayHT 3.0	2025	Closed API	10 s reference	40 languages	~200 ms first byte
OpenAI tts-1-hd	2024	Closed API	No (6 preset voices)	Multilingual	~400 ms first byte

Figure 20.1.4a: Representative TTS systems in late 2025. Latency numbers are author-measured for streaming endpoints on a single short utterance; production latency varies with prosody complexity, region, and concurrency.

# Comparison: F5-TTS zero-shot synthesis (open weights, mel + flow matching)
# pip install f5-tts
from f5_tts.api import F5TTS

f5 = F5TTS(
    model_type="F5-TTS",            # or "E2-TTS" for the smaller variant
    vocoder_name="vocos",           # 24 kHz, BigVGAN also supported
)

# Reference clip + its transcript anchor the voice and prosody.
# Five seconds of clean speech is enough; longer hurts more than it helps.
ref_audio = "alice_5sec.wav"
ref_text = "Hello, this is Alice and I am testing the synthesis system."

target_text = (
    "The flow matching denoiser inside F5-TTS predicts the velocity "
    "of the mel-spectrogram, not the spectrogram directly."
)

wav, sr, _ = f5.infer(
    ref_audio=ref_audio,
    ref_text=ref_text,
    gen_text=target_text,
    nfe_step=32,                     # number of flow-matching steps
    cfg_strength=2.0,                # classifier-free guidance scale
    speed=1.0,
)
import soundfile as sf
sf.write("f5_output.wav", wav, sr)
print(f"Generated {len(wav) / sr:.1f}s at sr={sr}")

Output: Generated 8.4s at sr=24000

Code Fragment 20.1.3a: Zero-shot voice cloning with F5-TTS. The reference audio plus its transcript condition the flow-matching DiT; nfe_step trades quality for latency in the same way num_inference_steps does in Stable Diffusion. The model handles cross-lingual transfer if gen_text is in a different language than ref_text.

20.1.7 Prosody from an LLM: The Emerging Pattern

A 2025 architectural trend that will dominate 2026 is using an LLM to generate the prosody plan and only delegating the acoustics to the TTS model. The pipeline looks like: (1) GPT-4o or Claude takes the text and an instruction ("read this as if you are calming a child after a nightmare"), and emits the same text annotated with SSML-like tags for pacing, emphasis, and emotion shifts; (2) a smaller TTS model (Cartesia or F5-TTS) executes that prosody plan. The factorization is appealing because the LLM already knows narrative pacing, sarcasm, and dialogue structure from its pretraining corpus, while the acoustic model has a much narrower job. Audiobox (Meta, 2023) and Voicebox (Meta, 2023) prototyped this with end-to-end joint training, but the production pattern of the API era is to keep the LLM and the TTS as separate services and rely on a structured intermediate representation between them.

Fun Fact: The voice that lives in 30 megabytes

F5-TTS at int8 quantization fits in 333 MB of weights and runs on a M2 MacBook. The same architecture that requires a data center to serve at ElevenLabs scale (50k concurrent streams, sub-100 ms TTFB) is, at the model-weights level, smaller than a single Premiere Pro render. The constraint is throughput, not capability.

Key Insight

Three TTS lineages cover the field: VITS-style VAE-flow for tight end-to-end training on bounded data, codec language models like Bark for prompt-driven expressivity at the cost of latency, and flow-matching DiTs like F5-TTS for zero-shot cloning with simple architectures. Commercial APIs (ElevenLabs, Cartesia, PlayHT) wrap one of these recipes behind sub-second TTFB streaming, multilingual coverage, and SLAs that open weights cannot match. The acoustic gap is small; the production gap is real.

Self-Check

Q1: Bark autoregresses over EnCodec tokens at 75 tokens per second per codebook with 8 codebooks. How many tokens does Bark generate per minute of audio, and what does that imply for the context window of the coarse-acoustic transformer if you want to generate 30-second clips in a single forward pass?

Show Answer

75 tokens per second times 8 codebooks times 60 seconds is 36,000 tokens per minute, or 18,000 tokens for a 30-second clip. Because Bark splits the work, the coarse-acoustic transformer only carries the first two codebooks (150 tokens per second), so a 30-second clip needs roughly 4,500 acoustic tokens plus the semantic-token conditioning. The transformer therefore needs a context window of at least 6k to 8k tokens to hold a full 30-second forward pass with margin for the semantic prefix; this is why Bark sets the coarse model's positional limit near 1024 frames and why longer clips have to be generated as overlapping chunks rather than one pass.

Q2: F5-TTS uses flow matching instead of variational inference. Sketch why flow matching avoids the posterior-collapse and KL-balancing problems that plague conditional VAE TTS, and connect this to the rectified flow discussion in Section 31.1.

Show Answer

Variational inference factorizes generation through a latent z and trains by maximizing a lower bound that contains a KL term against a prior. When the decoder is strong (a HiFi-GAN or large DiT), the KL term collapses to zero, the latent stops carrying information, and you have to hand-tune a KL weight to keep the posterior alive. Flow matching sidesteps this entirely: it learns the velocity field that transports a Gaussian prior to the data distribution, with no latent variable and no KL balance to maintain. The loss is a simple regression to the conditional optimal-transport velocity, the same rectified-flow recipe used for video in Section 31.1, so the only knob is the number of sampling steps. F5-TTS inherits this stability and reaches strong quality with a 333M-parameter DiT and 100k hours of speech.

Q3: Why does the LLM-prosody-planner-plus-small-TTS factorization in Section 7 work better in production than a single large end-to-end model, even though end-to-end models theoretically reach a lower joint loss?

Show Answer

A joint end-to-end model has to learn narrative pacing, sarcasm, dialogue turn-taking, and acoustic detail all from speech data, where the supervision for high-level pragmatics is extremely sparse. The factored pipeline uses an LLM that has already absorbed those pragmatics from the entire web at pretraining, then hands a structured prosody plan (SSML-like tags for pacing, emphasis, emotion) to a 300M-parameter acoustic model whose only job is to render that plan into 24 kHz audio. The factored loss is higher on a held-out joint task, but the LLM's prior on narrative structure dominates in deployment, latency is lower because the acoustic stage is small, and each component can be A/B-tested and swapped independently. End-to-end joint training (Audiobox, Voicebox) prototyped the idea, but production reliability favors keeping the LLM and the TTS as separate services with a structured intermediate representation.

In the next section, Section 20.2: Voice Cloning, Zero-Shot TTS, and Voice Conversion, we continue.

What Comes Next

Section 20.2 takes the zero-shot cloning capability of F5-TTS and asks the harder question: when is cloning a person's voice a consent violation, and how do we build pipelines that respect that boundary while still letting users dub their own audiobooks? After that, Section 20.3 pivots from speech to music: the same codec-LM and flow-matching ideas that produce ElevenLabs and F5-TTS also produce Suno and Udio.

Further Reading

Kim, J., Kong, J., & Son, J. (2021). Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. ICML 2021. arXiv:2106.06103. The VITS paper that established end-to-end neural TTS.
Chen, Y. et al. (2024). F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching. arXiv:2410.06885. The state-of-the-art open-weight zero-shot TTS system at the time of writing.
Suno AI (2023). Bark. github.com/suno-ai/bark. The reference codec-LM TTS implementation; MIT-licensed weights and inference code.
Wang, C. et al. (2023). Neural Codec Language Models are Zero-Shot Text-to-Speech Synthesizers (VALL-E). arXiv:2301.02111. Microsoft's three-second prompt cloning system; the first prominent codec-LM TTS.
Le, M. et al. (2023). Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale. NeurIPS 2023. arXiv:2306.15687. Meta's flow-matching TTS that pioneered the F5 lineage.
Cartesia AI (2024). Sonic: Mamba-Based State-Space TTS for Real-Time Conversational AI. cartesia.ai/blog/sonic. The first production codec-LM-style TTS built on state-space models instead of transformers.
ElevenLabs (2024). v3 Technical Notes and Voice Library. elevenlabs.io/docs. The commercial reference for expressive multilingual TTS in 2024-2025.
OpenAI (2024). Audio API: tts-1 and tts-1-hd. platform.openai.com/docs/guides/text-to-speech. The simplest production TTS endpoint, useful as a low-friction baseline.