Section 27.2: Audio, Music & Video Generation

"Give me a text prompt and I will compose you a symphony. Give me thirty seconds of your voice and I will speak in your likeness forever."
Pixel, Musically Inclined AI Agent

Big Picture

Generative AI has expanded beyond text and images into audio, music, and video. Modern text-to-speech systems produce natural-sounding voices from seconds of reference audio. Music models compose original songs in specified genres and styles. Video generation models create cinematic clips from text descriptions. These modalities share core architectural ideas with image generation (diffusion, transformers, flow matching) but introduce unique challenges: temporal coherence, audio waveform synthesis, and the enormous computational demands of high-resolution video. Together with the image generation techniques from Section 27.1, they form the complete stack of multimodal generative AI. The voice pipeline components here connect directly to the conversational AI systems from Section 21.5.

Prerequisites

This section builds on the multimodal foundations from Section 27.1. You should also be familiar with LLM API usage from Section 10.2 and prompt engineering from Section 11.1, as vision-language models are accessed through the same API patterns.

1. Text-to-Speech (TTS) Systems

Text-to-speech has undergone a revolution in the past few years. Traditional concatenative and parametric systems have been replaced by neural models that produce speech nearly indistinguishable from human recordings. The modern TTS pipeline typically involves a text encoder that converts input text to phonemes or tokens, an acoustic model that generates mel-spectrograms or audio tokens, and a vocoder that converts these representations into audible waveforms.

Fun Fact

Modern voice cloning models can replicate your voice from about 3 seconds of audio. That is roughly the time it takes to say "I do not consent to voice cloning."

VITS: End-to-End Speech Synthesis

VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) combines a variational autoencoder, normalizing flows, and adversarial training into a single end-to-end model. Unlike earlier two-stage approaches (text to spectrogram, then spectrogram to waveform), VITS generates raw audio directly from text, producing high-quality speech with natural prosody. It remains one of the most efficient architectures for real-time TTS. Code Fragment 27.2.5 below puts this into practice.

# Using Coqui TTS (open-source VITS implementation)
from TTS.api import TTS

# List available models
print(TTS().list_models())

# Load a VITS model for English
tts = TTS(model_name="tts_models/en/ljspeech/vits")

# Generate speech from text
tts.tts_to_file(
 text="Neural text-to-speech has made enormous progress in recent years.",
 file_path="output_vits.wav",
)

# Multi-speaker model with voice cloning
tts_multi = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2")
tts_multi.tts_to_file(
 text="Voice cloning requires only a few seconds of reference audio.",
 file_path="cloned_output.wav",
 speaker_wav="reference_voice.wav", # 6+ seconds of target speaker
 language="en",
)

> tts_models/en/ljspeech/vits > tts_models/en/ljspeech/tacotron2-DDC > tts_models/multilingual/multi-dataset/xtts_v2 ... > vocoder_models/en/ljspeech/hifigan_v2 [TTS] Model: tts_models/en/ljspeech/vits loaded successfully. > Saving output to: output_vits.wav

Code Fragment 27.2.1: Using Coqui TTS (open-source VITS implementation)

Library Shortcut: OpenAI TTS API in Practice

Production-quality TTS in 4 lines with the OpenAI TTS API (pip install openai):


from openai import OpenAI
client = OpenAI()

response = client.audio.speech.create(
 model="tts-1-hd",
 voice="nova",
 input="Neural text-to-speech has made enormous progress in recent years.",
)
response.stream_to_file("output.mp3")

Detected language: en (probability 0.98) [0.0s - 4.2s] Welcome everyone to the quarterly planning meeting. [4.2s - 8.7s] Let's start with a review of last quarter's objectives. [9.1s - 14.3s] Revenue came in at 4.2 million, which is 12% above target. ...

Code Fragment 27.2.2: Working with openai, OpenAI

Bark: Generative Audio with Paralinguistics

Bark, developed by Suno, takes a different approach by modeling speech as a sequence of audio tokens using an autoregressive transformer (similar to how GPT models text). This token-based approach naturally handles not just speech but also laughter, music, background noise, and paralinguistic cues. Bark generates semantic tokens from text, converts them to coarse acoustic tokens, then refines them to fine acoustic tokens, with each stage handled by a separate transformer. Code Fragment 27.2.6 below puts this into practice.


# Bark: token-based speech generation with paralinguistic cues
# Supports laughter, emotion, and speaker presets through text markers
from transformers import AutoProcessor, BarkModel
import scipy

# Load Bark model
processor = AutoProcessor.from_pretrained("suno/bark")
model = BarkModel.from_pretrained("suno/bark")
model = model.to("cuda")

# Generate speech with paralinguistic cues
text = "Hello! [laughs] This is an example of Bark generating speech with emotion."

inputs = processor(text, voice_preset="v2/en_speaker_6")
inputs = {k: v.to("cuda") for k, v in inputs.items()}

audio_array = model.generate(**inputs)
audio_array = audio_array.cpu().numpy().squeeze()

# Save the generated audio
sample_rate = model.generation_config.sample_rate
scipy.io.wavfile.write("bark_output.wav", rate=sample_rate, data=audio_array)

Code Fragment 27.2.3: Generating speech with Bark, Suno's token-based audio model that supports paralinguistic cues like laughter and emotion. Notice the voice preset parameter that selects a speaker identity, and the text markers in brackets that trigger non-speech audio effects during generation.

F5-TTS and Zero-Shot Voice Cloning

F5-TTS represents the latest generation of TTS models built on flow matching (the same technique behind Flux for image generation). It uses a diffusion transformer (DiT) architecture to generate mel-spectrograms from text, conditioned on a reference speech sample. The flow matching approach enables fast, high-quality generation with natural prosody. F5-TTS achieves remarkable zero-shot voice cloning from as little as 3 seconds of reference audio, raising important safety and ethical questions (Chapter 32) about consent and misuse. It is one of the most accessible voice cloning systems available. Figure 27.2.1 shows the modern TTS architecture.

Figure 27.2.1: Modern TTS architecture. Text and speaker embeddings condition a diffusion or flow matching model that generates mel-spectrograms, which a vocoder converts to audio.

Real-Time Conversational Audio

GPT-4o introduced native audio input and output, meaning the model can listen, understand, and respond with natural speech in real time. Rather than the traditional pipeline of speech-to-text followed by LLM processing followed by text-to-speech, GPT-4o processes audio tokens directly within the transformer, preserving nuances like intonation, emotion, and speaking pace. This enables sub-200ms latency for conversational applications. Kyutai's Moshi follows a similar approach as an open-source alternative, using a multi-stream architecture that processes both the user's speech and its own generated speech simultaneously, enabling natural turn-taking and even interruption handling.

Real-World Scenario: Audio Token Budget

Who: A backend engineer at a voice-first customer service startup building a real-time conversational agent.

Situation: The team was evaluating whether to use a native audio model (GPT-4o style) or a traditional speech-to-text pipeline for processing customer calls averaging 5 minutes in length.

Problem: EnCodec at 24 kHz with 8 codebooks produces approximately 75 tokens per second per codebook, giving 75 × 8 = 600 tokens/sec total. A 5-minute call would consume 180,000 audio tokens, rapidly exhausting the model's context window and inflating API costs. For comparison, the same 5 minutes of speech contains roughly 750 spoken words, which tokenize to about 1,050 text tokens. Audio is roughly 170x more token-dense than text.

Decision: The team adopted a hybrid approach: native audio processing (GPT-4o) for the active conversational turn (under 30 seconds, roughly 18,000 tokens) to preserve intonation and emotion, and a speech-to-text pipeline (faster-whisper) for the full call history to keep context costs manageable.

Result: The hybrid pipeline reduced per-call token consumption by 94% compared to full native audio processing while preserving the nuance detection that made the conversational agent effective. API costs dropped from $0.54 per call to $0.04 per call.

Lesson: The 170x token density gap between audio and text means native audio models are practical only for short segments. For long-form audio understanding, a speech-to-text pipeline remains the cost-effective default, with native audio reserved for latency-sensitive, nuance-critical interactions.

Note

Two paradigms dominate modern TTS. Spectrogram-based approaches (VITS, F5-TTS) generate mel-spectrograms that a vocoder converts to waveforms. They offer fine-grained control over prosody and are well-understood. Token-based approaches (Bark, VALL-E, GPT-4o) discretize audio into tokens using neural codecs like EnCodec, then model speech as a sequence prediction problem. Token-based systems naturally handle non-speech sounds and enable unified multimodal models, but may produce artifacts at token boundaries. The field is converging toward token-based representations as codec quality improves.

Production Tip

Deploying Whisper for production speech-to-text. OpenAI's Whisper remains the most widely deployed speech-to-text model in 2025. For production use, consider faster-whisper (CTranslate2 backend), which runs 4x faster than the original with lower memory usage. Use whisper-large-v3-turbo for the best accuracy/speed trade-off. Key deployment patterns: (1) run the VAD (Voice Activity Detection) filter to skip silence and reduce processing time by 30 to 50%; (2) for real-time transcription, use chunked streaming with 30-second segments and 5-second overlaps to maintain context across boundaries; (3) for multi-language deployments, let Whisper auto-detect the language in the first 30 seconds rather than specifying it, as misspecification degrades accuracy. For cloud deployments, the OpenAI Whisper API ($0.006/minute) is cost-effective below 10,000 hours/month; beyond that, self-hosting on GPU instances is more economical.


# Production-ready speech-to-text with faster-whisper
# Uses CTranslate2 backend for 4x speed improvement over original Whisper
from faster_whisper import WhisperModel

# Load the turbo variant for best speed/accuracy trade-off
model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")

# Transcribe with VAD filter to skip silence
segments, info = model.transcribe(
 "meeting_recording.mp3",
 vad_filter=True, # Skip silent segments
 vad_parameters={"min_silence_duration_ms": 500},
 beam_size=5,
 word_timestamps=True, # Get per-word timing
)

print(f"Detected language: {info.language} (probability {info.language_probability:.2f})")
for segment in segments:
 print(f"[{segment.start:.1f}s - {segment.end:.1f}s] {segment.text}")

Code Fragment 27.2.4: Production-ready speech-to-text with faster-whisper

Library Shortcut: OpenAI Whisper API in Practice

Zero-setup transcription with the OpenAI Whisper API (no GPU required):


from openai import OpenAI
client = OpenAI()

with open("meeting_recording.mp3", "rb") as audio_file:
 transcript = client.audio.transcriptions.create(
 model="whisper-1",
 file=audio_file,
 response_format="verbose_json",
 timestamp_granularities=["word"],
 )
print(f"Language: {transcript.language}")
for word in transcript.words:
 print(f"[{word.start:.1f}s] {word.word}")

Code Fragment 27.2.5: Production-ready speech-to-text with faster-whisper

2. Music Generation

Music generation with AI has progressed from simple MIDI patterns to full song production with vocals, instrumentation, and complex arrangements. The core challenge is modeling long-range temporal structure: music has hierarchical patterns from individual notes (milliseconds) to phrases (seconds) to sections (minutes) that must all be coherent.

MusicLM and MusicGen

Google's MusicLM was the first model to generate high-fidelity music from text descriptions at 24kHz. It uses a hierarchical sequence-to-sequence approach: MuLan embeddings (a music-text joint embedding model) condition semantic tokens from w2v-BERT, which then condition acoustic tokens from SoundStream. Meta's MusicGen simplified this into a single autoregressive transformer that generates EnCodec audio tokens directly, conditioned on text or melody. MusicGen introduced efficient codebook interleaving patterns that allow generating multiple codec streams with a single model pass. Code Fragment 27.2.7 below puts this into practice.


# MusicGen: generate music from a text description using Meta's
# single-pass autoregressive transformer over EnCodec audio tokens.
from audiocraft.models import MusicGen
import torchaudio

# Load the medium MusicGen model (1.5B parameters)
model = MusicGen.get_pretrained("facebook/musicgen-medium")
model.set_generation_params(
 duration=10, # seconds of audio to generate
 top_k=250, # nucleus sampling parameter
 temperature=1.0,
)

# Generate from a text description
descriptions = ["upbeat jazz trio with walking bass and brushed drums"]
wav = model.generate(descriptions) # returns tensor [batch, channels, samples]

# Save as WAV file at 32kHz sample rate
torchaudio.save("jazz_trio.wav", wav[0].cpu(), sample_rate=32000)

Code Fragment 27.2.6: Generating music with MusicGen by providing a text description of the desired style, tempo, and instrumentation. The model uses Meta's audiocraft library and generates EnCodec audio tokens autoregressively from the text condition. The set_generation_params method controls duration, sampling strategy, and temperature.

Suno and Udio: Full Song Generation

Suno and Udio represent the current state of the art in full song generation, producing complete songs with vocals, instrumentation, and lyrics. These commercial systems accept text descriptions of genre, mood, and style, along with optional lyrics, and generate radio-quality tracks of 2 to 4 minutes. While the exact architectures are proprietary, they likely combine text-conditioned audio generation with separate vocal synthesis and mixing stages. The quality has reached a point where generated music is difficult to distinguish from human-produced tracks in blind listening tests.

Warning

Music generation raises significant legal and ethical questions. Models trained on copyrighted music may reproduce recognizable melodies, chord progressions, or production styles. Suno and Udio face ongoing lawsuits from major record labels. When deploying music generation, consider: the training data provenance, whether outputs could constitute derivative works, the legal landscape in your jurisdiction, and whether your use case requires royalty-free generation. Using models trained exclusively on licensed or public domain music reduces legal exposure.

Tip

Before building a music generation feature into your product, consult your legal team about training data provenance. Models trained on copyrighted music expose you to derivative-work liability regardless of how "different" the output sounds. If you need royalty-free generation, restrict your pipeline to models trained exclusively on licensed or public-domain audio, and keep a written record of each model's training data sources.

3. Text-to-Video Generation

Text-to-video is arguably the most challenging generative modality, requiring the model to produce temporally coherent sequences of frames that are individually high quality and collectively tell a consistent visual story. A single second of 24fps 1080p video contains roughly 50 million pixels, compared to about 1 million for a single 1024x1024 image.

The core challenge of video generation is temporal coherence: each frame must look good on its own, but consecutive frames must also form a smooth, physically plausible motion sequence. An image generator can produce a beautiful still of a dog; a video generator must ensure that the dog's legs move correctly, that the background stays consistent as the camera pans, and that objects obey basic physics across frames. This is why video generation is orders of magnitude harder than image generation, and why the field lags behind by roughly 2 to 3 years in quality. The same architectural principles that power image generation (diffusion, transformers, flow matching) apply to video, but with the added dimension of time making everything more expensive and more difficult.

Architecture: Diffusion Transformers (DiT) for Video

Most modern video generation models extend the Diffusion Transformer (DiT) architecture to handle spatiotemporal data. Instead of processing a single image's latent representation, the model processes a 3D latent volume (height, width, time). Attention operates across both spatial and temporal dimensions, either through factored attention (separate spatial and temporal attention layers) or full 3D attention. The latent space comes from a video VAE that compresses frames both spatially and temporally. Figure 27.2.2 illustrates the Video DiT architecture.

Figure 27.2.2: Video DiT architecture. A 3D diffusion transformer with spatial, temporal, and cross-attention layers denoises video latents, which the video VAE decodes into frames.

Leading Video Generation Models

OpenAI's Sora demonstrated that scaling DiT architectures to video produces remarkable results, generating up to 60-second clips with consistent characters, realistic physics, and cinematic camera movements. Runway's Gen-3 Alpha focuses on controllable commercial video generation with features like motion brush and camera control. Kuaishou's Kling 2 achieves strong temporal consistency using a 3D VAE that compresses video in both space and time. Google's Veo 2 generates high-definition video with excellent prompt adherence and physical realism. Code Fragment 27.2.6 below puts this into practice.

Model Comparison

Model	Developer	Max Duration	Resolution	Key Strength
Sora	OpenAI	60 sec	1080p	Temporal coherence, physics
Runway Gen-3 Alpha	Runway	10 sec	1080p	Controllability, motion brush
Kling 2	Kuaishou	10 sec	1080p	3D VAE, consistency
Veo 2	Google	8 sec	4K	Resolution, prompt adherence
CogVideoX	Zhipu AI	6 sec	720p	Open-source, extensible
Wan 2.1	Alibaba	5 sec	720p	Open-source, image-to-video

# Using CogVideoX (open-source) via diffusers
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

pipe = CogVideoXPipeline.from_pretrained(
 "THUDM/CogVideoX-5b",
 torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()

prompt = "A golden retriever running through a sunlit meadow, slow motion, cinematic"

video = pipe(
 prompt=prompt,
 num_videos_per_prompt=1,
 num_inference_steps=50,
 num_frames=49,
 guidance_scale=6.0,
).frames[0]

export_to_video(video, "golden_retriever.mp4", fps=8)

Code Fragment 27.2.7: Generating video with CogVideoX through the diffusers library, producing short clips from text descriptions.

Key Insight

Video generation's hardest problem is not frame quality (which borrows from image generation) but temporal coherence. Objects must maintain consistent appearance across frames, physics must be plausible (gravity, reflections, shadows), and camera motion must feel natural. Models achieve this through temporal attention layers, 3D latent spaces that encode time alongside space, and training on large-scale video datasets. The rapid quality improvements from Sora onward suggest that scaling compute and data for video DiT architectures is a reliable path to better results.

4. 3D Generation

3D content generation from text or images is an emerging frontier. Current approaches include score distillation sampling (SDS), which uses a pre-trained 2D diffusion model to optimize a 3D representation (NeRF or Gaussian splatting) so that it looks correct from every viewing angle. DreamFusion pioneered this approach, and subsequent work like Instant3D and LRM (Large Reconstruction Model) has made 3D generation faster and more reliable.

Multimodal Composition Pipelines

Real-world applications often combine multiple generative modalities into a single pipeline. A film production tool might use an LLM to write a script, a video model to generate scenes, a TTS model for dialogue, and a music model for the soundtrack. Orchestrating these components requires careful attention to temporal synchronization, style consistency, and resource management. Frameworks like ComfyUI provide node-based interfaces for building such pipelines, while programmatic approaches use Python to chain models together. Figure 27.2.3 shows the multimodal composition pipeline.

Figure 27.2.3: Multimodal composition pipeline. An LLM generates the script, which drives parallel video, speech, music, and SFX generation before final composition.

Note

Video generation is extremely compute-intensive. Generating a single 5-second clip at 720p can take 5 to 15 minutes on a high-end GPU. Training video models requires thousands of GPUs running for weeks. This computational cost is the primary reason video generation lags behind image generation in quality and accessibility. Cloud APIs (Runway, Sora, Kling) abstract this cost into per-second pricing, while open-source models like CogVideoX and Wan let researchers experiment on smaller scales with reduced resolution and frame counts.

Self-Check

Q1: What are the two main paradigms for modern TTS, and how do they differ?

Show Answer

The two paradigms are spectrogram-based (VITS, F5-TTS) and token-based (Bark, VALL-E, GPT-4o). Spectrogram-based systems generate mel-spectrograms that a vocoder converts to waveforms, offering fine-grained prosody control. Token-based systems discretize audio into tokens using neural codecs like EnCodec and model speech as sequence prediction, naturally handling non-speech sounds and enabling unified multimodal models.

Q2: How does GPT-4o's approach to audio differ from traditional speech-to-text plus text-to-speech pipelines?

Show Answer

Traditional pipelines convert speech to text, process it with an LLM, then convert the response back to speech, losing paralinguistic information (tone, emotion, pacing) at each conversion. GPT-4o processes audio tokens directly within the transformer, preserving these nuances and achieving sub-200ms latency by eliminating the separate conversion stages.

Q3: Why is temporal coherence the hardest challenge in video generation?

Show Answer

Temporal coherence requires maintaining consistent object appearance, plausible physics (gravity, reflections, shadows), and natural camera motion across dozens to hundreds of frames. Unlike single images, video generation must ensure that each frame is not only individually high-quality but also consistent with all neighboring frames, creating an exponentially larger constraint space.

Q4: How does MusicGen use codebook interleaving to efficiently generate audio?

Show Answer

Neural audio codecs like EnCodec produce multiple parallel streams (codebooks) of tokens at different frequency resolutions. MusicGen uses codebook interleaving patterns to flatten these parallel streams into a single sequence that one autoregressive transformer can generate. This avoids the need for separate models per codebook while maintaining audio quality across frequency bands.

Q5: What is score distillation sampling (SDS) and how does it enable text-to-3D generation?

Show Answer

SDS uses a pre-trained 2D diffusion model as a critic to optimize a 3D representation (NeRF or Gaussian splatting). The 3D object is rendered from random viewpoints, and the diffusion model provides gradients indicating how to make each rendering look more like the text description. By optimizing across many viewpoints, the 3D object converges to something that looks correct from every angle, effectively lifting 2D image generation knowledge into 3D.

Real-World Scenario: Multilingual Audio Localization for an E-Learning Platform

Who: Content engineering team at a global online education company

Situation: The platform had 500+ hours of English-language course videos that needed localization into 8 languages to serve international markets.

Problem: Professional voice actors and recording studios cost $200 per finished minute per language. Localizing the entire catalog would exceed $4.8 million and take over a year.

Dilemma: Synthetic voices were affordable and fast, but learners reported lower engagement and trust with obviously robotic narration. Voice cloning from the original instructors raised consent and authenticity concerns.

Decision: The team used zero-shot voice cloning (VALL-E style) with explicit instructor consent, generating speech in target languages that preserved each instructor's vocal characteristics.

How: They extracted 10-second reference clips from each instructor, used a multilingual TTS model to generate localized audio, and ran quality checks with MOS (Mean Opinion Score) evaluation on a sample of outputs. Instructors reviewed and approved their synthetic voice profiles.

Result: Localization cost dropped to $3 per finished minute (98.5% reduction). The full catalog was localized in 6 weeks. Learner engagement metrics in localized courses matched the English originals within 5%.

Lesson: Zero-shot voice cloning makes large-scale audio localization economically viable, but ethical deployment requires explicit consent from the people whose voices are being synthesized.

Tip: Combine OCR with Vision Models for Documents

For document understanding, run OCR first to extract text, then send both the text and the image to the vision model. This gives the model two complementary signals and significantly improves accuracy on tables, forms, and handwritten text.

Key Takeaways

Modern TTS uses either spectrogram-based (VITS, F5-TTS) or token-based (Bark, VALL-E) approaches, with zero-shot voice cloning from seconds of reference audio now readily available.
Real-time conversational audio (GPT-4o, Moshi) processes audio tokens directly in the transformer, enabling natural voice interaction with sub-200ms latency.
Music generation has progressed from instrumental snippets to full songs with vocals. MusicGen provides an open-source baseline, while Suno and Udio push commercial quality.
Video generation extends image diffusion to spatiotemporal volumes. DiT architectures with spatial and temporal attention achieve remarkable quality, though compute costs remain very high.
Temporal coherence is the defining challenge of video generation, requiring consistent object appearance, plausible physics, and natural motion across frames.
Multimodal composition pipelines combine text, image, audio, video, and music generation into end-to-end production workflows, orchestrated by LLMs or visual programming tools.

Research Frontier

Open Questions:

Can audio and video generation achieve the same level of controllability as image generation? ControlNet equivalents for video (camera path control, object motion control) are emerging but remain limited. Audio generation lacks fine-grained control over individual instruments and mixing.
How can we detect AI-generated audio deepfakes reliably? As voice cloning quality improves, the arms race between generation and detection intensifies. Watermarking approaches (as discussed in Section 32.3) may provide one path forward.

Recent Developments (2024-2025):

OpenAI's Sora (2024) and Google's Veo 2 (2025) demonstrated that scaling DiT architectures produces video with remarkable temporal coherence and physical realism, though generation times remain in the minutes-per-clip range.
Kyutai's Moshi (2024) demonstrated open-source real-time conversational audio with natural turn-taking, providing an alternative to GPT-4o's proprietary audio capabilities.
Music generation models from Suno and Udio now produce full songs with vocals that are difficult to distinguish from human performances in blind tests, raising urgent questions about copyright and creative attribution.

Explore Further: Build a multimodal composition pipeline that generates a 15-second clip with synchronized narration (TTS), background music (MusicGen), and video (CogVideoX). Measure the temporal alignment quality across modalities.

Exercises

Exercise 27.2.1: TTS Architecture Overview Conceptual

Describe the key components of a modern text-to-speech system. What is the difference between a two-stage pipeline (text-to-mel then vocoder) and an end-to-end approach?

Answer Sketch

Two-stage: a model like Tacotron converts text to a mel spectrogram, then a vocoder like HiFi-GAN converts the spectrogram to audio waveform. End-to-end: models like VALL-E or Bark generate audio tokens directly from text, often producing more natural prosody. Two-stage is more interpretable; end-to-end is simpler to deploy but harder to debug.

Exercise 27.2.2: Voice Cloning Ethics Discussion

A company wants to use voice cloning to create custom voices for their customer service bot. Discuss the ethical considerations, potential harms, and safeguards they should implement.

Answer Sketch

Considerations: consent (the voice donor must explicitly consent), misuse potential (deepfake audio for fraud), identity rights (who owns a synthetic voice?). Safeguards: require explicit consent with clear usage terms, embed audio watermarks in synthetic speech, implement speaker verification to prevent unauthorized cloning, and maintain an audit trail of all voice cloning requests.

Exercise 27.2.3: Music Generation Constraints Conceptual

Compare MusicGen and Suno in terms of architecture, output quality, and controllability. What are the current limitations of AI music generation?

Answer Sketch

MusicGen (Meta): transformer-based, generates from text prompts and optional melody conditioning, good controllability. Suno: proprietary, generates vocals and instruments, higher production quality. Limitations: long-range structure (songs lack coherent song-level form), repetitive patterns, limited genre coverage, copyright concerns with training data, and difficulty with complex musical concepts like counterpoint.

Exercise 27.2.4: Video Generation Pipeline Coding

Write Python code that generates a short video clip from a text description using a text-to-video model. Include frame rate configuration and output file saving.

Answer Sketch

Use a model like ModelScope or CogVideo via the diffusers library. Load the pipeline, set the prompt, configure num_frames and fps. Generate frames, then export as an MP4 using export_to_video(frames, 'output.mp4', fps=8). Current models generate short clips (2 to 4 seconds); longer videos require frame interpolation or sequential generation.

Exercise 27.2.5: Temporal Consistency Conceptual

Why is temporal consistency a major challenge in video generation? Explain the problem and describe two techniques that modern video models use to maintain consistency across frames.

Answer Sketch

Temporal consistency means objects should maintain their appearance, position, and physics across frames. Without it, objects flicker, change shape, or teleport. Techniques: (1) Temporal attention layers that attend across frames (not just spatially within a frame), allowing the model to maintain object identity. (2) Motion conditioning (optical flow or motion vectors) that provides explicit movement guidance, reducing random frame-to-frame variations.

What Comes Next

In the next section, Section 27.3: Document Understanding and OCR, we cover document understanding and OCR, using multimodal models to extract information from documents, forms, and images.

Bibliography

Speech Synthesis

Wang, C., Chen, S., Wu, Y., et al. (2023). "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)." arXiv:2301.02111

Introduces the neural codec language model approach to TTS, treating speech as discrete tokens and generating them autoregressively. Demonstrates that three seconds of reference audio suffices for voice cloning. Essential for understanding the paradigm shift from traditional TTS to LLM-based speech synthesis.

Speech Synthesis

Kim, J., Kong, J., & Son, J. (2021). "Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech (VITS)." arXiv:2106.06103

Combines VAE with GAN training for end-to-end speech synthesis that produces natural prosody without a separate vocoder. Remains influential as the basis for many production TTS systems. Valuable for practitioners deploying real-time speech synthesis.

Speech Synthesis

Music Generation

Copet, J., Kreuk, F., Gat, I., et al. (2023). "Simple and Controllable Music Generation (MusicGen)." arXiv:2306.05284

Presents Meta's single-stage transformer approach to music generation using EnCodec tokens, with conditioning on text, melody, or both. Covers the codebook interleaving strategy that avoids cascaded models. Recommended for developers exploring creative audio applications.

Music Generation

Video Generation

Blattmann, A., Dockhorn, T., Kulal, S., et al. (2023). "Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets." arXiv:2311.15127

Extends latent diffusion to video generation, addressing temporal coherence through 3D convolutions and temporal attention layers. Covers the data curation pipeline critical for video model quality. Important for researchers working on video generation and temporal consistency challenges.

Video Generation

Audio Foundations

Défossez, A., Copet, J., Synnaeve, G., & Adi, Y. (2023). "High Fidelity Neural Audio Compression (EnCodec)." arXiv:2210.13438

Introduces the residual vector quantization codec that converts audio waveforms into discrete tokens, enabling language model approaches to audio generation. This tokenization scheme underpins VALL-E, MusicGen, and many other audio models. Required background for understanding modern audio generation architectures.

Audio Foundations