Section 20.3: Music Generation: MusicLM, MusicGen, Suno, and Udio

"In 2023 we were thrilled when a neural model could produce 30 seconds of coherent indie folk. In 2025 Suno v5 ships full three-minute pop singles with verse-chorus structure, intelligible lyrics, and a hook you cannot get out of your head. The state of the art moves in months, not years."
Echo, Melodically-Inclined AI Agent

Big Picture

Music generation followed the same trajectory as text-to-image but compressed into two years. MusicLM (Google, January 2023) was the proof of concept; MusicGen (Meta, June 2023) shipped the first open-source recipe; Suno (V1 in late 2023, V5 in October 2025) and Udio (V1 in April 2024, V3 in early 2025) turned it into a consumer category with paying users measured in the millions. The architectural core in every case is the same machinery as Chapter 20.1's TTS: a neural audio codec (EnCodec, SoundStream, DAC) compresses raw audio into a discrete token stream, and a transformer language model autoregresses over those tokens conditioned on text. What changes across systems is the codec, the conditioning, and the scale. This section unpacks all three, then walks the 2026 commercial landscape and the open-source baseline you can run on a single GPU.

Prerequisites

This section assumes the audio-codec tokenization from Section 20.1, the transformer fundamentals from Section 4.1, and an understanding of latent diffusion from Section 19.7.

20.3.1 Audio Codecs: The Tokenizer of Music

Raw 44.1 kHz stereo audio is 1.4 megabits per second; a four-minute song is 40 megabytes. No transformer is going to autoregress over 88,000 floating-point samples per second of audio. The bridge between waveforms and language models is the neural audio codec: a VQ-VAE-style encoder-decoder that compresses audio to a sequence of discrete tokens at a manageable rate (typically 50-75 tokens per second per residual codebook, with 4-32 codebooks stacked).

Think of a codec as the audio analog of subword tokenization for text. A text tokenizer converts "transformers" into something like ["transform", "ers"] at roughly 1 token per 4 characters. An audio codec converts a 16-bit waveform into a sequence of discrete codebook indices at roughly 1 token per 320 input samples. Both replace a high-bitrate raw signal (UTF-8 bytes or PCM samples) with a much shorter sequence of integers drawn from a fixed vocabulary, so an LLM can autoregress over them with manageable context length and quadratic-attention cost.

Three codecs dominate the field, each from a different lab and each with a distinct sweet spot:

EnCodec (Defossez et al., Meta, 2022) was the first widely used neural codec. It operates at 24 kHz with 8 residual codebooks at 75 Hz, giving 600 tokens per second of audio.
SoundStream (Zeghidour et al., Google, 2021) preceded EnCodec and uses a similar residual-VQ structure. MusicLM uses SoundStream tokens directly.
DAC (Descript Audio Codec, Kumar et al., 2023) supports 44.1 kHz stereo and is the codec of choice for music-quality systems. It produces about 86 tokens per second per codebook with 9 codebooks. Suno and Udio are widely believed to use proprietary codecs derived from DAC.

Key Insight

Music quality bottoms out at the codec, not the LM

The maximum audio quality a music-generation system can produce is bounded above by the reconstruction quality of its codec at the bitrate the LM autoregresses over. If your codec sounds like AM radio when you encode and decode a Beatles song without any LM in between, then any LM trained on top will produce AM-radio music no matter how big the LM is. The 2024 jump from MusicGen-quality output to Suno-quality output was driven as much by codec improvements (44.1 kHz stereo, higher bitrate, better psychoacoustic objectives) as by transformer scaling. When you are evaluating open recipes for production, listen to a pure codec round-trip first; that is the ceiling.

20.3.2 MusicGen: The Open-Source Baseline

MusicGen (Copet et al., Meta, 2023, arXiv:2306.05284) is the reference open-source music generator. It comes in 300M, 1.5B, and 3.3B parameter sizes. The architecture is a single transformer decoder that autoregresses over EnCodec tokens at 32 kHz, conditioned on text via a frozen T5 encoder (or melody via a chromagram for the melody-conditioned variant). The clever piece is the "delay pattern" interleaving: instead of generating one full codebook then the next, MusicGen interleaves codebooks with offset delays so the model never has to predict more than one codebook in a single forward pass, sharply reducing the effective vocabulary at each step and the cost of long-context modeling.

MusicGen at 3.3B can produce 30-second clips that are clearly identifiable as the prompted genre, with discernible chord progressions and rhythm. It struggles with vocals, song structure beyond a single section, and long-term coherence. As a baseline for fine-tuning or as a teaching artifact for "how does this work," it is the model to study; as a finished product, it is two years behind the commercial frontier.

Worked Example: The Delay Pattern in MusicGen

MusicGen autoregresses over EnCodec tokens with $K = 4$ codebooks at 50 Hz. Naively the model would have to predict 4 tokens per time step, which means either an enormous output head (vocabulary $1024^4 \approx 10^{12}$) or four parallel heads that ignore intra-step dependencies. The delay pattern stagger codebook $k$ by $k - 1$ time steps. For a clip of $T = 10$ frames the resulting prediction lattice looks like:

\begin{array}{c|cccccccccccc} \text{step} & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & 11 & 12 \\ \hline \text{cb}_1 & t_1^1 & t_2^1 & t_3^1 & t_4^1 & t_5^1 & t_6^1 & t_7^1 & t_8^1 & t_9^1 & t_{10}^1 & - & - \\ \text{cb}_2 & - & t_1^2 & t_2^2 & t_3^2 & t_4^2 & t_5^2 & t_6^2 & t_7^2 & t_8^2 & t_9^2 & t_{10}^2 & - \\ \text{cb}_3 & - & - & t_1^3 & t_2^3 & t_3^3 & t_4^3 & t_5^3 & t_6^3 & t_7^3 & t_8^3 & t_9^3 & t_{10}^3 \\ \text{cb}_4 & - & - & - & t_1^4 & t_2^4 & t_3^4 & t_4^4 & t_5^4 & t_6^4 & t_7^4 & t_8^4 & t_9^4 \end{array}

where $t_i^k$ is the codebook-$k$ token for frame $i$ and the dash entries are padding. At decoding step $s$ the model predicts the next column (4 tokens) in parallel; each is conditioned on every earlier predicted token through the transformer's causal attention. The total sequence length is $T + K - 1 = 13$ steps for $T = 10$ frames of audio, and each step needs only a vocabulary of $1024$ per codebook (four parallel heads), not $1024^4$. A 30-second MusicGen clip then corresponds to $T = 1500$ frames and a sequence of $1500 + 4 - 1 = 1503$ decoding steps with $4 \times 1503 \approx 6000$ token predictions in total. Without the delay pattern the same clip would either need $6000$ sequential steps (4x more decoder calls) or sacrifice within-step dependencies entirely.

MusicGen delay-pattern interleaving: four EnCodec codebooks staggered by 0, 1, 2, 3 time steps so each decoding step predicts one fresh token per codebook in parallel — **Figure 20.3.1**: MusicGen's delay-pattern interleaving for $K = 4$ EnCodec codebooks. Codebook $k$ is shifted by $k - 1$ time steps, so codebook 1 starts at step 1, codebook 2 at step 2, and so on. Each decoding column predicts one fresh token per codebook in parallel through four output heads, each over a $1024$-entry vocabulary. The pattern preserves intra-step dependencies (codebook $k$ at step $s$ attends to all earlier predicted tokens, including codebook $k-1$ at step $s$) without exploding the per-step vocabulary to the $1024^4 \approx 10^{12}$ a joint head would require. The grey pad cells at the edges are masked out of the loss; the published MusicGen recipe uses exactly this delay-1 pattern at $T \approx 1500$ frames for a 30-second clip.

# MusicGen 30-second clip via the AudioCraft library
# pip install audiocraft
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

# 3.3B parameter checkpoint; fits in 12 GB of VRAM at fp16.
model = MusicGen.get_pretrained("facebook/musicgen-large")
model.set_generation_params(
    duration=30,      # 30-second clip
    top_k=250,        # MusicGen-recommended sampling
    top_p=0.0,        # top_k only
    temperature=1.0,
    cfg_coef=3.0,     # classifier-free guidance
)

prompts = [
    "an upbeat 90 BPM lo-fi hip-hop track with a warm Rhodes piano "
    "and a vinyl crackle texture, perfect for late-night studying",

    "a cinematic orchestral piece in C minor with a soaring french horn "
    "melody over pulsing strings, building to a triumphant climax",
]

wavs = model.generate(prompts)     # shape: (batch, channels, samples)
for idx, wav in enumerate(wavs):
    audio_write(f"clip_{idx}", wav.cpu(), model.sample_rate,
                strategy="loudness", loudness_compressor=True)
    print(f"Wrote clip_{idx}.wav at {model.sample_rate} Hz")

Output: Wrote clip_0.wav at 32000 Hz Wrote clip_1.wav at 32000 Hz

Code Fragment 20.3.1a: Generating two 30-second music clips with MusicGen-Large. The model autoregresses over EnCodec tokens conditioned on T5-encoded text prompts. Note strategy="loudness": MusicGen output is typically low-volume and needs LUFS normalization for production.

20.3.3 MusicLM and the Hierarchical Semantic-Acoustic Pattern

MusicLM (Agostinelli et al., Google, 2023, arXiv:2301.11325) predates MusicGen by a few months and uses a more elaborate hierarchical decomposition. It cascades three models: SoundStream produces acoustic tokens at 600 Hz, w2v-BERT produces "semantic" tokens at 25 Hz representing high-level musical structure, and MuLan (a text-music CLIP-style joint embedding model) produces conditioning tokens from the text prompt. The pipeline generates MuLan tokens (text -> joint embedding), then semantic tokens (high-level structure), then acoustic tokens (audio-quality SoundStream codes), and finally decodes to waveform. The structural decomposition is the same idea Bark uses for speech (Section 20.1), and it shows up again in Suno's reported pipeline.

Google never released MusicLM weights, but the architecture is documented and recipes for hierarchical music generation (Open MuLan, the AudioLM lineage) are public. The hierarchical pattern matters because pure end-to-end token generation over 30 seconds of 44.1 kHz audio requires very long contexts and very expensive training; factoring the problem into semantic + acoustic layers makes both parts tractable.

MusicLM hierarchical pipeline: MuLan text-music joint embedding produces conditioning tokens which condition a semantic transformer producing w2v-BERT semantic tokens which condition an acoustic transformer producing SoundStream tokens which decode to waveform — **Figure 20.3.2a**: MusicLM's hierarchical text-to-music pipeline. Three transformers cascade: MuLan converts the text prompt into a joint text-music embedding, a semantic transformer produces w2v-BERT tokens at 25 Hz that encode high-level musical structure (key, tempo, instrumentation), and an acoustic transformer produces SoundStream tokens at 600 Hz that encode the audio-quality detail. The SoundStream decoder finally reconstructs the 24 kHz waveform. Factoring through the slow semantic stream is what makes 30-second generation tractable: the semantic transformer sees only 750 tokens for the full clip, while the acoustic transformer's much longer 540 k token sequence is conditioned on a small expanded semantic context rather than autoregressed from scratch.

The hierarchical factoring is best read as a chain rule on the joint probability. Writing $m$ for the MuLan text embedding, $s_{1:T_s}$ for the semantic-token sequence (length $T_s$ at 25 Hz), and $a_{1:T_a}$ for the acoustic-token sequence (length $T_a$ at 600 Hz), MusicLM models

p(a_{1:T_a} \mid \text{text}) = \sum_{s_{1:T_s}} p_\theta(a_{1:T_a} \mid s_{1:T_s}) \, p_\phi(s_{1:T_s} \mid m), \qquad m = f_\mathrm{MuLan}(\text{text}).

At inference the sum is approximated by a single sampled draw of $s_{1:T_s}$ from $p_\phi$ followed by a sampled draw of $a_{1:T_a}$ from $p_\theta$. The two transformers train independently: $p_\phi$ on (semantic token, MuLan embedding) pairs derived from MuLan's frozen text-music encoder, and $p_\theta$ on (acoustic, semantic) pairs derived from the same training clips. Factoring through the 24x-shorter semantic stream is what makes long-range musical structure (key, tempo, instrument layout) learnable: the acoustic transformer never has to invent that structure from a text prompt directly, it just decorates the structure the semantic transformer already laid down.

Library Shortcut: Open-Source MusicLM Stand-In via MusicGen

import torch, scipy.io.wavfile
from transformers import MusicgenForConditionalGeneration, AutoProcessor

# Google never released MusicLM weights. MusicGen is the open-source replacement
# that uses the same codec-LM recipe (single transformer over EnCodec tokens).
processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
model = model.to("cuda" if torch.cuda.is_available() else "cpu").eval()

prompts = [
    "calm acoustic guitar with light percussion, 90 BPM",
    "upbeat 80s synthwave with arpeggiated bass, 120 BPM",
]
inputs = processor(text=prompts, padding=True, return_tensors="pt").to(model.device)

with torch.no_grad():
    audio = model.generate(**inputs, max_new_tokens=512, do_sample=True,
                           guidance_scale=3.0)
# audio shape: (batch=2, 1, ~10.24 s at 32 kHz)

scipy.io.wavfile.write("guitar.wav",    rate=32_000, data=audio[0, 0].cpu().numpy())
scipy.io.wavfile.write("synthwave.wav", rate=32_000, data=audio[1, 0].cpu().numpy())

Code Fragment 20.3.3a: MusicGen is the open-source stand-in for MusicLM. The architectural difference: MusicGen folds MuLan, the semantic transformer, and the acoustic transformer into a single transformer over EnCodec tokens with the delay codebook pattern from Section 20.0.2.7, trading MusicLM's explicit hierarchy for a simpler training recipe. The text conditioning uses T5-Base embeddings rather than MuLan, but the downstream codec-LM mechanic is identical. max_new_tokens=512 at the 50 Hz EnCodec frame rate gives roughly 10 seconds of audio; raising it to 1500 produces the published 30-second clips.

Worked Example: MusicLM Token Budget for a 30-Second Clip

Count the tokens MusicLM allocates per second of generation, and contrast with a flat acoustic-only transformer. For a 30-second 24 kHz clip, MusicLM's three streams have lengths $T_\mathrm{MuLan} = 1$ (conditioning token), $T_s = 30 \times 25 = 750$ semantic tokens, and $T_a = 30 \times 600 = 18{,}000$ acoustic tokens (12 SoundStream codebooks across all RVQ stages, flattened to 600 tokens/sec). A flat acoustic-only transformer would need to autoregress all 18,000 acoustic tokens conditioned only on the 1 text token, which is intractable: at attention cost $O(T^2 d)$ with $d = 1024$, one forward pass would cost $\approx 18{,}000^2 \cdot 1024 \approx 3.3 \times 10^{11}$ FLOPs per layer, or about 7.9 TFLOPs across 24 layers. The semantic transformer instead pays $\approx 750^2 \cdot 1024 \cdot 24 \approx 1.4 \times 10^{10}$ FLOPs ($\approx 14$ GFLOPs), and the acoustic transformer pays its 7.9 TFLOPs only conditioned on a much sharper prior, so it converges from far less training data. The bigger payoff is musical coherence: a 30-second clip generated directly from text routinely drifts in key or instrumentation halfway through, while a 30-second clip with a pre-sampled semantic prior holds its key and lead instrument for the full duration in 9 out of 10 listener studies reported in the MusicLM paper.

20.3.4 Suno, Udio, and AIVA: The Commercial Frontier

Suno is the consumer leader. Suno V3 (March 2024) was the first system to generate full songs with vocals, lyrics, and recognizable verse-chorus structure under user control. V4 (late 2024) added longer outputs (up to 4 minutes) and better lyric intelligibility. V5 (October 2025) ships with stem outputs (separate vocal, drums, bass, and instrument stems), genre interpolation between two reference prompts, and an extend mode that adds a chorus or bridge to existing user-provided audio. Suno's reported architecture is a hierarchical codec LM (semantic + acoustic, plus a separate text-to-lyric model) trained on tens of thousands of hours of licensed music, with proprietary codec improvements.

Udio (April 2024) is the closest competitor. Udio V1 launched within weeks of Suno V3 and made comparable claims; V3 (early 2025) added precise prosody control over lyrics, instrumentals-only mode, and a "remix" feature where the user supplies a melodic motif. Udio's audio quality at the high end is often judged equal to Suno's; the systems differ more in default style biases (Udio leans cleaner and more pop-produced; Suno leans grittier and more genre-promiscuous).

AIVA (Artificial Intelligence Virtual Artist) is the longest-running commercial music AI, dating to 2016. AIVA targets composers and film scorers rather than consumers: it ships symbolic MIDI output that the user can edit in a DAW, multi-instrument scores in classical and cinematic styles, and royalty-free licensing terms designed for media production. AIVA is the right tool when you need editable score output; Suno and Udio are right when you need finished audio.

System	Year	Format	Max Length	Vocals	Stems	Notes
MusicGen 3.3B	2023	Open weights	30 s	No	No	Reference open implementation
MusicLM	2023	Closed (paper only)	5 min	Hummed only	No	Hierarchical SoundStream + w2v-BERT
Suno V5	2025	SaaS	4 min	Yes (lyric input)	Yes (V5+)	Genre interpolation, extend mode
Udio V3	2025	SaaS	4 min	Yes	Partial (vocals only)	Strong on cleaner pop production
AIVA	2016-2026	SaaS + MIDI	Unlimited	No (symbolic only)	Per-instrument MIDI	For composers, not consumers
Stable Audio 2.5	2025	Open weights + paid	3 min	No	No	Diffusion in latent audio space
Riffusion	2022-2024	Open weights	5 s loops	No	No	Spectrogram-as-image diffusion

Figure 20.3.3: 2026 music generation landscape. Suno and Udio are the consumer-grade frontier; MusicGen is the canonical open recipe; AIVA is the composer's tool; Stable Audio occupies the production-music middle.

Real-World Scenario: Production Music for a YouTube Channel

A small YouTube creator producing weekly 10-minute essays used to pay $25 per track for stock music from Epidemic Sound, totaling about $1,300 a year for 50 episodes. In 2025 the same creator subscribes to Suno Pro ($24/month) and Udio Pro ($30/month), generates 5-10 candidate tracks per episode, picks the best two, and ends up with a budget under $700 a year and signature music nobody else has. The catch is rights: Suno and Udio both grant commercial use of generated audio to paid subscribers, but the legal question of who owns AI-generated music in derivative-of-training-data terms is still unresolved in the US (per the 2024 RIAA v. Suno and RIAA v. Udio lawsuits). Most creators accept this risk for ad-supported YouTube content but avoid using AI music in brand-sponsored work.

20.3.5 Conditioning: Text, Melody, and Reference Audio

Music generation systems support four conditioning modalities, ordered from universal to controversial:

Text prompts are the universal interface. The model is trained with paired text-music data (LP captions, Spotify metadata, expert tags from MusicCaps).
Melody conditioning takes a chromagram or a humming clip and generates a full arrangement that follows the user's melodic intent. MusicGen-melody, Suno's "extend with this idea" mode, and Udio's remix feature all use this.
Reference audio conditioning passes a short clip of an existing song and asks the model to produce something stylistically similar. This is the most controversial mode because of the obvious copyright implications, and the commercial APIs add safeguards that prevent verbatim cloning of known recordings.
Lyric conditioning is unique to vocals-capable systems (Suno, Udio). The user provides text and the model synthesizes vocals that pronounce those lyrics over the generated instrumental.

Warning: The lyric-pronunciation gap

Lyric intelligibility in 2026 systems is still notably lower than spoken-word TTS intelligibility. The reason is structural: vocals are heavily processed in production (compression, reverb, doubling, harmonization), and the codec LM has to predict the produced waveform, not the dry vocal. Suno V5 and Udio V3 both ship "clean vocal" modes that produce a more intelligible but less polished result; in production you typically pick one or the other per use case rather than expecting both at the same time.

20.3.6 The Rights and Training Data Question

Every commercial music generator faces the same legal question: what music was it trained on, and what license terms apply to the outputs? The 2024 lawsuits (RIAA suing Suno and Udio, separately) argue that both systems were trained on unlicensed copies of major-label recordings, that the outputs can be made to sound like specific artists, and that this constitutes infringement at training time and at inference time. Suno and Udio's defenses center on fair use of training data and substantial transformation of outputs. The lawsuits are unresolved as of late 2025, and they will set precedent for the entire music AI category.

The technical countermeasures the platforms have adopted fall into four categories:

Training-data filtering to remove known commercial recordings (claimed by both companies, hard to verify).
Output-side filtering that blocks generations matching the audio fingerprint of any known recording (Suno added this in V4).
Opt-out registries for artists who do not want their style imitated (Suno's "Studio" program).
Revenue-sharing pilots with select labels (Udio's 2025 deal with Beatport for electronic music).

None of these resolves the underlying training-data question, but together they shape the product into something defensible in court.

Diagram showing audio codec compression, hierarchical semantic + acoustic LM stack, and the conditioning paths from text, melody, and reference audio — **Figure 20.3.4**: The canonical hierarchical music-generation pipeline: text/melody/reference conditioning feeds a semantic LM that produces high-level structure tokens, which condition an acoustic LM that produces codec tokens, which decode to waveform.

Fun Fact: The "hum-to-song" pipeline

A favorite consumer demo of 2025 is the iPhone hum-to-song workflow: open Voice Memos, hum a melody for 15 seconds, drop the clip into Udio with prompt "turn this into a stadium-rock anthem in the style of Queen with full band and harmonies." Thirty seconds later you have a four-minute single. Songwriters used to call this the "$50k demo budget" stage of pitching to a label; in 2025 it is two minutes on a phone. Whether that is empowering or terrifying depends on which side of the table you sit on.

20.3.7 Production Music and the Mid-Tier Stack

Below the consumer-finished Suno/Udio tier and above the research-baseline MusicGen tier sits the production-music middle: Stability AI's Stable Audio 2.5 (latent diffusion over audio, 44.1 kHz, up to 3 minutes), Splice's CoSo (loop-and-stem generation aimed at producers in DAWs), and Soundraw and Boomy (consumer-facing services that license out-of-the-box royalty-free tracks). Stable Audio in particular is interesting because it uses diffusion rather than codec-LM generation; the underlying autoencoder produces a latent representation at 21 Hz over which a latent DiT denoises. This is closer to image diffusion than to text-to-speech generation, and it leads to qualitatively different outputs (Stable Audio is strong at ambient, electronic, and sound-design textures; weaker at lyric-driven pop).

The mid-tier matters for production because it ships under licenses that commercial users find acceptable (full ownership of outputs, no per-use royalty), and because it integrates with DAWs (Logic Pro, Ableton Live, FL Studio) via plugins. A 2026-era audio professional typically uses several systems in a single project: Suno for first-draft songs, Stable Audio for ambient beds, AIVA for editable orchestral cues, and a human producer to glue it all together.

Key Insight

Music generation is TTS scaled to longer contexts, higher bitrates, and conditioning on melody and lyrics instead of just text. Three architectural patterns recur: codec-LM autoregression (MusicGen, Suno, Udio), hierarchical semantic-plus-acoustic generation (MusicLM), and latent audio diffusion (Stable Audio). The commercial frontier (Suno V5, Udio V3) ships consumer-finished songs with vocals and structure; open recipes (MusicGen) ship 30-second clips. Codec quality and training-data licensing are the two bottlenecks; the model architecture is mostly solved.

Self-Check

Q1: EnCodec at 24 kHz with 8 codebooks at 75 Hz produces 600 tokens per second. A four-minute Suno song is 240 seconds. Compute the total token count; compare to GPT-4's 128k-token context; and sketch why hierarchical semantic-plus-acoustic generation (Section 3) is necessary in practice for full-song output.

Show Answer

600 tokens per second times 240 seconds is 144,000 tokens, which already exceeds GPT-4's 128k context. A flat autoregression over those tokens would blow past the context window before the song finished, and the quadratic attention cost over a 144k sequence makes training and inference prohibitive. Hierarchical generation factors the problem: a semantic transformer plans the song at 25 Hz (only 6,000 tokens for the full four minutes), an acoustic transformer fills in the codec tokens at the higher rate but conditioned on the semantic plan, and a fine codebook stage cleans up the upper codebooks. Each stage operates on a sequence length that is feasible to train, while still producing four minutes of coherent audio when they are composed.

Q2: Stable Audio uses latent diffusion over a 21 Hz audio representation, while MusicGen uses codec-LM autoregression over a 50-75 Hz representation. Which architectural choice is better suited to long ambient textures, which is better suited to short lyric-driven pop, and why?

Show Answer

Latent diffusion at 21 Hz is well suited to long ambient and sound-design textures, because the slow latent rate compresses 3 minutes of audio into a tractable sequence and diffusion denoises the whole window jointly, which produces consistent timbres and slow textural evolution. Codec-LM autoregression at 50-75 Hz is better for short lyric-driven pop, because higher token rates retain enough resolution to render consonants and percussive transients, and the autoregressive structure naturally handles strict left-to-right dependencies like lyric pronunciation and bar-aligned drum hits. The trade-off is the canonical diffusion-versus-autoregression split applied to audio: diffusion wins on long-range coherence in smooth signals, autoregression wins on local precision under tight temporal constraints.

Q3: The RIAA v. Suno lawsuit alleges that the model was trained on unlicensed commercial recordings. List the technical countermeasures that would let Suno credibly defend the training-data side of the case while still producing competitive output.

Show Answer

Section 6 lists four. First, training-data filtering with audio fingerprinting (a Shazam-style hash) against major-label catalogs, so any recording matching a known commercial track is excluded before pretraining. Second, output-side filtering that blocks any generated clip whose fingerprint matches a known recording, which Suno added in V4 to deter verbatim memorization. Third, an opt-out registry that lets named artists list their style as off-limits, with style-classifier guardrails on inference. Fourth, revenue-sharing pilots with labels (Udio's Beatport deal is the public reference) that convert unlicensed training into licensed training going forward. The first three reduce technical infringement risk while keeping the long tail of public-domain, CC, indie, and licensed catalog material intact; the fourth is the only one that resolves the underlying training-data question rather than just shaping the product around it.

In the next section, Section 20.4: Audio Editing: Stems, Style Transfer, and Remixing, we continue.

What Comes Next

Section 20.4 leaves pure generation behind and asks: once we have a music or speech clip, how do we edit it? Stem separation, audio inpainting, and style transfer are the post-production sister disciplines of generation.

Further Reading

Copet, J. et al. (2023). Simple and Controllable Music Generation (MusicGen). NeurIPS 2023. arXiv:2306.05284. The MusicGen architecture and the delay-pattern interleaving trick.
Agostinelli, A. et al. (2023). MusicLM: Generating Music From Text. arXiv:2301.11325. The hierarchical semantic-plus-acoustic pattern; foundational for Suno's reported architecture.
Defossez, A. et al. (2022). High Fidelity Neural Audio Compression (EnCodec). arXiv:2210.13438. The 24-kHz residual-VQ neural codec underlying MusicGen and Bark.
Kumar, R. et al. (2023). High-Fidelity Audio Compression with Improved RVQGAN (DAC). arXiv:2306.06546. The 44.1-kHz stereo codec used by Stable Audio and (in derived form) by the music-quality commercial systems.
Suno (2025). Suno V5: Stems, Genre Interpolation, and Extend Mode. suno.com/blog/v5. The October 2025 product announcement; documents the V5 feature set used in production.
Udio (2025). Udio V3 Release Notes. udio.com/blog/v3. The V3 feature documentation including remix and instrumentals-only modes.
Stability AI (2024). Stable Audio 2: Diffusion for Music Generation. stability.ai/news/stable-audio-2-0. The latent-diffusion alternative to codec-LM music generation.
RIAA (2024). RIAA v. Suno; RIAA v. Udio. riaa.com. The 2024 lawsuits that will set legal precedent for music-AI training data.