Voice Cloning, Zero-Shot TTS, and Voice Conversion

Section 20.2

"In 2018, cloning a voice required four hours of clean studio recording. In 2024, it requires five seconds and a YouTube link. The technology has run ahead of consent law, and the ethics work has to catch up."

EchoEcho, Voice-Curious AI Agent
Big Picture

Zero-shot voice cloning is the most consequential capability gain in audio AI of the 2022-2026 era. The same flow-matching DiT (F5-TTS) and codec language model (VALL-E, XTTS-v2) that powers Section 20.1's text-to-speech pipeline also lets a user clone a voice from a five-second reference clip and synthesize arbitrary new utterances in that voice. Two engineering tracks dominate this section: capability (how do we make zero-shot cloning faithful, multilingual, and fast), and safety (how do we prevent fraud, deepfake-driven scams, and unauthorized cloning of public figures). Both tracks pull on the same model, and a 2026-grade production system has to ship both.

Prerequisites

This section assumes the codec-LM and flow-matching TTS pipelines from Section 20.1 and the speaker-embedding intuition from Section 19.2. Deepfake-detection and watermarking background covered later in the book deepens the safety discussion here.

20.2.1 From Speaker Encoders to Prompt-Conditioned Cloning

The earliest neural voice cloning (Jia et al., 2018) factored the problem into three networks: a pretrained speaker encoder that mapped a reference audio clip to a 256-D embedding, a Tacotron 2 synthesizer conditioned on that embedding, and a WaveNet vocoder. The speaker encoder was trained with a generalized end-to-end loss (Wan et al., 2018) on a speaker-verification objective; it had never seen the synthesizer's loss. The decoupling worked but produced perceptibly artificial clones, especially on prosody and rhythm.

The 2023 generation collapsed this. VALL-E (Wang et al., 2023) trained a codec LM end-to-end with the reference clip and its transcript as a prefix, so the model learned to copy not just timbre but also speaking rate, breath patterns, microphone characteristics, and even the room acoustics of the reference. XTTS-v2 (Coqui, 2023) shipped the same idea as an open model that fine-tunes per speaker in 30 seconds on a single GPU. F5-TTS (Section 20.1) made the masking explicit: at training, the model sees pairs of (reference clip + its text) and (masked target frames + target text), and learns to fill in the masked frames in the reference's voice. This is the "infilling pretraining" the F in F5 stands for.

Key Insight: Cloning is masked language modeling on audio

Once audio is a sequence of codec tokens or mel frames, zero-shot voice cloning is structurally identical to in-context learning for LLMs (Section 7.3). The reference clip and its transcript are the few-shot prefix. The target text is the query. The synthesized audio is the completion. The model copies style for the same reason an LLM copies the format of a few-shot prompt: the loss it minimized at training rewards it for treating the prefix as a style template. The 2018 multi-network factorization was an externally engineered version of what one transformer learns implicitly when trained on the right data.

20.2.2 Anatomy of a Five-Second Clone

What does five seconds of audio actually contain that the model can use? At 24 kHz, 5 seconds is 120k PCM samples. After EnCodec compression at 8 codebooks of 75 tokens per second per book, that is 3,000 tokens, comparable to a 500-word prefix in a text LLM. Inside that prefix the model can extract: voice timbre (F0 statistics, formant structure, vocal-tract length), pacing (average phoneme duration, pause distribution), emotional tone (energy variance, micro-prosody), accent (vowel formants, consonant articulation), and microphone or room characteristics. It cannot extract: long-term narrative pacing, dialogue switching behavior, or content-conditional emotion shifts; those require longer references.

The empirical sweet spot for a single voice is 5-30 seconds. Below 5 s, the speaker-verification embedding is noisy and the clone wanders. Above 30 s, marginal returns drop sharply, and the clip is more likely to contain noise, music, or a second speaker that confuses the model. Production systems (ElevenLabs Pro, Cartesia Voice Library) take 1-3 minutes of clean speech and fine-tune a small per-voice adapter on top of the base codec LM; this gives a quality bump comparable to going from a 7B base to a domain-fine-tuned 7B in the LLM analogy.

20.2.3 Cross-Lingual Voice Transfer

One of the most striking capabilities of the 2024 generation is cross-lingual cloning: a five-second clip of an English speaker is enough to synthesize the same voice speaking fluent Japanese, Hindi, or Polish. The mechanism that makes this work is data-driven rather than architectural. F5-TTS, XTTS-v2, and VALL-E X are all trained on multilingual corpora where the same speaker rarely appears in multiple languages; the model is forced to learn a representation that disentangles voice identity from language. At inference, you provide an English reference and a target text in Japanese, and the model conditions on the timbre and prosodic style of the reference while pulling Japanese phonotactics from its broader training distribution.

The disentanglement is imperfect. Accent leakage is the most common failure: the clone of an American English speaker reading Japanese will have a slight American accent on certain mora, particularly long vowels and pitch accent shifts. For most consumer use cases (game NPCs, audiobook localization with a single narrator across markets) this is a feature; for high-stakes cases (broadcast-quality Japanese narration for a documentary) it is a bug, and the production fix is a small monolingual fine-tune on top of the base.

Real-World Scenario: Audiobook Localization at Spotify

Spotify's 2024 audiobook localization launch used OpenAI's voice cloning API to dub English-language audiobooks into Spanish, French, German, Portuguese, and Italian while preserving the original narrator's voice. The pipeline was: (1) GPT-4 translates the manuscript with stylistic markup, (2) the cloning model synthesizes per-paragraph audio conditioned on a 30-second consented reference from the original narrator, (3) a separate music-and-effects layer is preserved from the English master, (4) human review catches accent leakage and mispronunciations before publication. The published result was indistinguishable from a human dub for monolingual listeners, with the original narrator's permission and a per-language royalty share.

# Cross-lingual voice cloning with F5-TTS
# Uses the multilingual checkpoint trained on the Emilia 100k-hour corpus.
from f5_tts.api import F5TTS
import soundfile as sf
import torch

f5 = F5TTS(model_type="F5-TTS", vocoder_name="vocos")

# Reference: a short English clip from the consenting narrator
ref_audio = "narrator_consent_clip.wav"     # 8 s of clean speech
ref_text = "This is my voice and I consent to its use for the audiobook."

# Target text in a different language. F5-TTS auto-detects the language
# from the script and conditions the phonemization accordingly.
spanish_target = (
    "En el principio creó Dios los cielos y la tierra. "
    "La tierra estaba desordenada y vacía."
)

wav, sr, _ = f5.infer(
    ref_audio=ref_audio,
    ref_text=ref_text,
    gen_text=spanish_target,
    nfe_step=32,
    cfg_strength=2.5,            # slightly higher CFG for cross-lingual
    speed=0.95,                  # slow down to reduce accent artifacts
)
sf.write("narrator_spanish.wav", wav, sr)

# Verify identity preservation by computing speaker similarity
from resemblyzer import VoiceEncoder, preprocess_wav
enc = VoiceEncoder()
ref_emb = enc.embed_utterance(preprocess_wav(ref_audio))
gen_emb = enc.embed_utterance(preprocess_wav("narrator_spanish.wav"))
cosine_sim = (ref_emb @ gen_emb) / (
    (ref_emb @ ref_emb)**0.5 * (gen_emb @ gen_emb)**0.5
)
print(f"Speaker similarity (English ref vs. Spanish synth): {cosine_sim:.3f}")
# Values > 0.80 are good; < 0.65 means the clone has drifted.
Output: Speaker similarity (English ref vs. Spanish synth): 0.847
Code Fragment 20.2.1: Cross-lingual voice cloning followed by an automatic speaker-similarity check using Resemblyzer. A similarity above 0.80 typically means the cloned voice will sound right to human listeners; below 0.65 you should drop a slower speed or fine-tune a per-speaker adapter.
Fun Fact

A five-second clip is enough to clone a human voice in 2026, which means the bar for "voice authentication" has fallen below the bar for "leave a voicemail". The same five seconds that used to be small talk ("hey, it's me, call me back") is now a biometric breach. Banks that still ask "say your verification phrase" are essentially asking the attacker to record their own credential. The industry is quietly moving to behavioral signals (timing, conversation flow, knowledge questions) because the audio itself can no longer prove anything.

20.2.4 Voice Conversion vs. TTS Cloning

Voice cloning (TTS-style) takes text and produces audio in a target voice. Voice conversion (VC) takes existing audio of a source speaker and converts it to sound like a target speaker, preserving the source's words, pacing, and emotion. The two problems are related but distinct, and they call for different architectures. The 2024 state of the art in VC is RVC (Retrieval-based Voice Conversion) for community use and FreeVC for research-grade quality. Both share the structure: extract a speaker-independent content representation (using HuBERT or ContentVec), retrieve or synthesize speaker-specific acoustic features, and resynthesize audio with a HiFi-GAN-style decoder.

Under the Hood: Retrieval-based Voice Conversion (RVC)

Voice conversion first strips speaker identity out of the source audio. A self-supervised encoder like HuBERT or ContentVec produces frame-level content features that capture phonetic content and prosody but suppress timbre, since these models were trained to predict masked speech units rather than who is speaking. RVC adds a retrieval step: at conversion time it replaces each content frame with its nearest neighbor from a stored index of the target speaker's training frames, snapping the trajectory onto the target's acoustic manifold and reducing timbre leakage. A HiFi-GAN decoder then resynthesizes a waveform from the retrieved features, preserving the source's words and rhythm in the target's voice.

VC is the right tool when you have a source performance with the prosody and emotion you want, and you just need a different voice. Dubbing actors, video-game character voice swaps, and post-production ADR all use VC. TTS cloning is the right tool when you have a script and you need a voice; audiobook localization, agent voices, and accessibility readers all use TTS cloning.

Under the Hood
Self-supervised speech representations (wav2vec 2.0 / HuBERT)

Self-supervised speech encoders learn phonetic structure from raw audio without transcripts, which is why their features capture content while suppressing speaker identity. wav2vec 2.0 masks spans of a latent feature sequence and trains the model with a contrastive loss to pick the true quantized unit for each masked frame against distractors. HuBERT instead runs offline k-means clustering on features to produce pseudo-labels, then trains a BERT-style masked-prediction loss to classify the cluster id of masked frames, iterating the clustering as features improve. Because both objectives reward predicting linguistic units rather than reconstructing the waveform, the resulting frame embeddings encode phonemes and prosody but largely discard timbre, the exact property voice conversion exploits.

20.2.5 Deepfake Detection and Watermarking

The other side of cheap cloning is cheap voice fraud. A 2023 Federal Trade Commission analysis found that voice-cloning-driven scams (typically impersonating a family member in distress to extract a wire transfer) rose 300% year over year. The technical countermeasures fall into three buckets: (1) passive detection models that classify a clip as real or synthetic based on subtle artifacts, (2) active watermarking built into the synthesis model so every generated clip carries an imperceptible signature, and (3) provenance metadata via C2PA Content Credentials that travel with the audio file.

Passive detection is fragile. The 2024 ASVspoof challenge showed that detectors trained on one generation of synthesis models lose 30-50% of their accuracy on the next generation. Watermarking is more robust but only works if the generating model cooperates; open-source models can be patched to skip the watermark step. The most reliable defense in 2026 is provenance: when a phone call originates from a known authenticated source, the receiving device can trust it; when it does not, the user is alerted regardless of detection score.

Warning: The "code word" protocol

The most effective consumer-grade defense against voice-cloning scams remains a non-technical one: families agree on a code word that is never written down or used in casual conversation, and any phone call asking for urgent money must include the word. This works because the cloning attack can copy timbre and prosody but not memory. Several US banks now ship this advice in fraud-warning materials alongside multi-factor authentication for voice-banking systems.

Any production deployment of voice cloning needs a consent workflow, and the legal landscape requires this both in the US (under emerging right-of-publicity statutes like Tennessee's ELVIS Act, 2024) and in the EU (under the AI Act's transparency requirements for synthetic media). The standard production pattern has five steps.

First, the speaker reads a fixed consent script aloud, in their natural voice, including the date and a specific authorization phrase ("I, Jane Doe, consent on April 12 2025 to the cloning of my voice by Acme Corp for the purpose of audiobook narration"). Second, the clip is stored with a cryptographic hash and a signed timestamp. Third, the cloning model is trained or fine-tuned on top of additional voice data, never on the consent clip alone. Fourth, every synthesized output is watermarked and logged with a reference to the consent record. Fifth, the speaker has a kill-switch mechanism (typically a web portal) that revokes consent and triggers takedown of dependent published content.

Skipping any step is a legal exposure. ElevenLabs adopted this five-step pattern in 2024 after a series of public-figure cloning incidents; Cartesia, PlayHT, and OpenAI's Realtime Voice all ship variants of it. The cost of the workflow is a one-time five-minute onboarding per voice; the benefit is that the cloning service can defend itself if a downstream user produces fraudulent content.

LayerDefense MechanismRobustnessFriction
ModelWatermarking (e.g., AudioSeal)Medium (defeats casual misuse)Low (server-side)
DetectionClassifier on the clip (ASVspoof, Pindrop)Low (generation outpaces it)Low
FileC2PA Content CredentialsHigh (if metadata preserved)Medium (tooling immaturity)
WorkflowConsent script + signed logHigh (legal defensibility)Medium (5-min onboarding)
ChannelCode word + multi-factorHigh (per-user)High (user effort)
Figure 20.2.2: Layered defenses for voice cloning misuse. No single layer is sufficient; a 2026-grade production deployment ships at least the model-watermark, file-provenance, and consent-workflow layers, and recommends the code-word layer to end users.
Fun Fact: The "I am not a robot" of 2027

The standard CAPTCHA was a clever asymmetry: humans were better than computers at reading distorted text. As LLMs caught up, CAPTCHAs evolved to image grids and behavioral signals. Voice authentication may follow the same arc. Several 2025 prototypes ask the caller to repeat a randomly generated phrase including made-up words ("Please say: tropoglystic eldermarine vortexion"), since cloning models trained on natural speech corpora struggle with phonotactically novel utterances. The cat-and-mouse will continue, but the asymmetry gap is real and exploitable for now.

Key Insight

Voice cloning is in-context learning for audio: a five-second reference clip is a few-shot prompt that conditions a codec LM or flow-matching DiT to copy timbre, prosody, and accent. The capability is real, multilingual, and cheap. The corresponding safety work (consent workflows, watermarking, provenance, layered detection) is mature in API products and immature in open-source. A production deployment in 2026 ships both the capability and the safety stack; one without the other is a lawsuit waiting to happen.

Lab
Zero-Shot Voice Cloning with F5-TTS and Speaker-Similarity Scoring
Duration: ~60 minutes Intermediate

Objective

Clone a reference voice from a 10-second clip using the open-source F5-TTS model and synthesize 20 target sentences, then measure speaker similarity between the cloned speech and the reference using a pretrained speaker-verification embedding. The point is to feel both how easy zero-shot cloning has become and how to score it with the metric the safety community actually uses to detect deepfakes.

Setup

You need a CUDA-capable GPU (8 GB or more), the F5-TTS checkpoint (SWivid/F5-TTS on Hugging Face, paper at arXiv:2410.06885), and a speaker-verification model (the WavLM-base-plus model from Microsoft's wavlm-base-plus-sv). For the reference audio, record 10 seconds of your own voice or use the LibriTTS test-clean speaker set.

pip install f5-tts torchaudio transformers torch numpy scipy

Steps

  1. Prepare the reference. Trim a 10-second clip from a single speaker. Resample to 24 kHz mono. Transcribe the reference clip exactly; F5-TTS conditions on the reference text plus audio.
  2. Synthesize 20 target sentences with F5-TTS using the reference. The target sentences should cover varied phonetic content; the Harvard sentence list is the canonical starting point.
  3. Compute speaker embeddings for the reference and for each synthesized clip using WavLM-base-plus-sv. The cosine similarity between embeddings is the speaker-similarity score the deepfake-detection literature reports as SECS (Speaker Embedding Cosine Similarity).
  4. Report SECS distribution. Plot a histogram of the 20 SECS values. Published F5-TTS results on LibriTTS show median SECS above 0.65 against an in-distribution speaker, dropping for out-of-distribution speakers and recordings with significant background noise.
  5. Ablate the reference length. Re-run with 3-second and 30-second references and observe how SECS scales; under 5 seconds is where quality typically falls off, which is the empirical reason ElevenLabs and the production-grade cloning APIs ask for at least 30 seconds.

Expected Output

A folder of 20 WAV files plus a CSV of SECS values and a histogram PNG. The interesting reading of the result is the gap between subjective "this sounds like the same person" and the SECS number; both are useful signals, and the consent-workflow patterns in this section depend on tracking both.

Extension

Add an audio watermark to the F5-TTS output using the open AudioSeal model (Meta, 2024) and verify that the watermark survives MP3 compression and trimming. The watermark plus SECS score is the technical implementation of the consent workflow described in subsection 20.2.6.

Self-Check
Q1: If the speaker-similarity cosine on a clone falls from 0.85 (good) to 0.55 (poor) as soon as you switch from English to Mandarin, which of the four content-of-five-seconds factors (timbre, pacing, emotion, mic/room) is the model failing to disentangle, and what is the cheapest fix?
Show Answer
The collapse is on pacing rather than timbre. Mic and room characteristics are language-independent and timbre is the easiest factor for a speaker-verification network to lock onto, so they survive the language switch. Pacing, by contrast, is tangled with the language's phonotactics: an English-trained pacing pattern shoves Mandarin syllables into the wrong rhythm, and the verification model reads that prosodic mismatch as a different speaker. The cheapest fix is to drop the synthesis speed (the code fragment uses speed=0.95 for cross-lingual generation) and raise classifier-free guidance slightly, which lets the multilingual prior reassert the target language's rhythm; if the gap persists, a 30-second monolingual fine-tune of a per-speaker adapter closes it without retraining the base.
Q2: Watermarking is "medium robustness" because open-source models can be patched to skip it. Sketch a watermarking design that an open-source model cannot trivially remove, and explain the data and architecture costs of training it.
Show Answer
A removable watermark sits in a separate post-processing stage that the user can comment out. A non-removable watermark is baked into the generative loss itself: a small detector network is trained jointly with the synthesizer, the synthesizer is rewarded when the detector recovers the watermark bits from its output, and the detector is trained to be robust to codec compression, resampling, and equalization. AudioSeal is the public reference for this design. The architectural cost is the auxiliary detector and an extra adversarial loss term; the data cost is collecting the augmentation distribution (codec roundtrips, room impulse responses, additive noise) under which the watermark must survive. The economic catch is that removing the watermark now requires fine-tuning the entire synthesizer with a competing loss, which is far more effort than commenting out a post-processing step, so casual misuse is deterred.
Q3: The consent workflow in Section 6 stores a cryptographic hash and signed timestamp of the consent clip. Why is this materially different from just keeping a copy of the clip, and how does it interact with right-to-be-forgotten requests under GDPR?
Show Answer
A hash plus signed timestamp lets you prove, after the fact, that a specific clip existed at a specific moment without retaining the raw audio. When a user files a GDPR Article 17 right-to-be-forgotten request, you can delete the clip and any derived training data while still keeping the hash and the signature, which is sufficient legal evidence that consent was given before the user revoked it. Storing the raw clip indefinitely would obligate you to delete it on request and would erase the very record you need for legal defensibility. The hash-plus-signature pattern is the audio analog of how authentication systems store password hashes rather than passwords, and it is why ElevenLabs, Cartesia, and OpenAI all converged on the same five-step workflow.

In the next section, Section 20.3: Music Generation: MusicLM, MusicGen, Suno, and Udio, we continue.

What Comes Next

Section 20.3 leaves the human voice behind and asks: can the same codec-LM and flow-matching machinery generate music? Suno and Udio say yes; MusicGen ships the open recipe.

Further Reading
  • Jia, Y. et al. (2018). Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. NeurIPS 2018. arXiv:1806.04558. The speaker-encoder-plus-Tacotron lineage; foundational for what zero-shot cloning replaced.
  • Wang, C. et al. (2023). Neural Codec Language Models are Zero-Shot Text-to-Speech Synthesizers (VALL-E). arXiv:2301.02111. Microsoft's 3-second-prompt clone; the prompt-conditioning paradigm.
  • Coqui (2023). XTTS: A Massively Multilingual Zero-Shot Text-to-Speech. github.com/coqui-ai/TTS. Open-source 16-language clone; the practical reference.
  • Roman, R. et al. (2024). AudioSeal: Proactive Localized Watermarking for Speech. ICML 2024. arXiv:2401.17264. Meta's localized audio watermark that survives codec compression and editing.
  • FTC (2024). Voice Cloning Challenge: Preventing the Misuse of Cloning Technology. ftc.gov/voice-cloning-challenge. Policy framing and incentive-prize design for misuse prevention.
  • Tennessee General Assembly (2024). Ensuring Likeness, Voice, and Image Security (ELVIS) Act. tn.gov. First US state statute explicitly criminalizing unauthorized voice cloning.
  • C2PA (2024). Content Credentials Technical Specification 2.0. c2pa.org/specifications. Provenance-metadata format that lets verified-source audio retain authorship across edits.
  • Pindrop (2024). 2024 Voice Intelligence and Security Report. pindrop.com. Industry data on voice-cloning fraud rates and call-center mitigation.