Chapter 20: Audio, Music, and Video Generation

"Speech, music, and video are just more tokens; the trick is choosing the right vocabulary."
Echo, Pitch-Perfect AI Agent

Looking Back

Part IV trained text models. Part V opens the modalities: audio, vision, 3D, and the cross-modal reasoning that ties them together. This chapter starts with audio: Whisper, MusicGen, AudioLDM, the codec models, and the production pipelines that handle speech and music together.

Big Picture

Generative models for time-series media. The first half covers audio (text-to-speech with VITS/Bark/F5-TTS, voice cloning, music generation with MusicLM/MusicGen/Suno/Udio, audio editing, and ASR). The second half covers video (Diffusion Transformers, frontier video models Sora/Veo/Runway/Kling/Pika, camera and motion control, video editing, and long-form cinematic synthesis).

Chapter Overview

Audio, music, and video generation followed the same trajectory as image diffusion, just two years compressed. This chapter walks the practical stack: VITS, Bark, and F5-TTS for text-to-speech; zero-shot voice cloning and conversion (the most consequential capability gain of 2022 to 2026); MusicLM, MusicGen, Suno, and Udio for music; stem-aware editing and remixing; ASR as the underrated workhorse; Diffusion Transformers for video; Sora, Veo, Runway, Kling, and Pika as the frontier; camera and motion control; editing pipelines; and the long-form, cinematic generation problem that is still genuinely open.

Audio and video AI are the modalities where capability and product-market fit moved fastest in the 2024 to 2026 window. By the end of this chapter you will know which model to reach for, where the failure modes hide, and what the cost-latency-quality envelope looks like in production.

Note: Learning Objectives

Choose between VITS, Bark, and F5-TTS for a given TTS use case and latency budget.
Apply zero-shot voice cloning ethically, including consent, watermarking, and detection considerations.
Compare MusicGen, Suno, and Udio across controllability, audio fidelity, and licensing posture.
Architect a Diffusion Transformer pipeline for spatiotemporal latent video generation.
Evaluate Sora, Veo, Runway, Kling, and Pika on the capability and accessibility axes that matter for product work.
Design camera-control and motion-control workflows using ControlNet-style conditioning for video.
Diagnose the long-form coherence problems that still bound cinematic video generation in 2026.

Prerequisites

Modern LLM landscape from Chapter 7
Transformer architecture from Chapter 3
Basic familiarity with audio formats (waveforms, spectrograms, sampling rate)

Sections

Lab 20: Build a Podcast Transcript-and-Summary Pipeline With Whisper Plus Diarization

Objective

Build a pipeline that takes a 30-minute podcast MP3 and emits a clean transcript with speaker labels, plus a structured summary (topics, quotes, action items). By the end you will have a tool you can run on your own audio library, plus first-hand intuition for Whisper's strengths and limits.

Steps

Step 1: Get the audio. Download a public-domain podcast episode (e.g., from archive.org or your favorite podcast's free RSS feed). Aim for 20 to 40 minutes, with at least 2 speakers.
Step 2: ASR with Whisper. Run openai/whisper-large-v3 via Hugging Face pipelines. Save segments with timestamps as transcript.json.
Step 3: Speaker diarization. Use pyannote/speaker-diarization-3.1 (free Hugging Face token required). Merge with Whisper segments by overlap to produce "Speaker A: ...", "Speaker B: ..." lines.
Step 4: Hand-validate. Listen to 3 minutes of the audio while reading the diarized transcript. Note misattribution rate. Expect 5 to 15% errors on real podcasts; tune min_speakers/max_speakers if needed.
Step 5: LLM summary. Send the full transcript to GPT-4o-mini (or chunk if >128k tokens). Use a structured-output schema: {topics: list[str], key_quotes: list[Quote], action_items: list[str]}.
Step 6: Library shortcut. Try assemblyai (paid API) for the whole pipeline in 5 lines: transcript + diarization + summary in one call. Compare quality and total cost per hour of audio.

Expected Output

Expected time: 2 to 3 hours. Difficulty: beginner-to-intermediate. Artifact: a CLI that turns any podcast MP3 into a searchable transcript + summary.

What's Next?

Next: Chapter 21: Document Understanding and OCR. Audio synthesis showed how to map text into one non-text modality; document understanding goes the other direction. Chapter 21 covers modern OCR (TrOCR, Donut), layout-aware models (LayoutLMv3), VLM-based document QA, and the production pipelines that turn scanned PDFs and invoices into structured records, often the highest-ROI use of multimodal models in the enterprise.