Section 20.4: Audio Editing: Stems, Style Transfer, and Remixing

"Generation gets the headlines; editing pays the rent. The unsexy work of cleaning, separating, repairing, and remixing audio is where most production AI dollars actually go."
Echo, Audio-Editing AI Agent

Big Picture

Two acronyms recur throughout this section: DSP is Digital Signal Processing, the classical (pre-deep-learning) toolbox of filters, transforms, and resamplers; DAW is Digital Audio Workstation, the host application (Pro Tools, Logic, Ableton, etc.) into which these models ship as plugins. With those in hand: editing is the part of the audio AI stack that touches real customer workflows. Stem separation (Demucs, MDX-Net) splits a mixed song into vocals, drums, bass, and other; audio inpainting fills in clicks, dropouts, and unwanted sounds; style transfer reshapes a recording's timbre and ambience; and time-frequency operations (pitch shift, time stretch, denoising) get a quality bump from neural replacements for classical DSP. This section walks the four families, shows where each fits in a 2026 production pipeline, and connects the techniques to the diffusion and codec-LM machinery from Sections 32.1-32.3.

Prerequisites

This section assumes the music-generation foundations from Section 20.3, the source-separation and audio-codec topics in Section 20.1, and the diffusion-inpainting intuition from Section 19.7.

20.4.1 Stem Separation: Demucs, MDX-Net, and the Music Source Separation Frontier

Stem separation (or music source separation, MSS) takes a stereo mixed track and outputs separate channels for the constituent sources: typically vocals, drums, bass, and "other" (everything else). Before 2020 this was a heuristic-driven mess (Wiener filtering, REPET, spectral masking); after 2020 it became a neural problem with reliably superhuman quality for the vocal track. The 2024 state of the art is a small set of related architectures: Demucs v4 (Defossez et al., Meta, 2022 with successive refinements), MDX-Net (Kim et al., Sony, 2021-2024, evolution of the model that won MDX 2021), and the BS-RoFormer family (Lu et al., 2024) that recently set new benchmarks on the MUSDB18 test set.

Demucs v4 (the "Hybrid Transformer Demucs" or "htdemucs" variant) is the open-source default. It is a U-Net with a hybrid time-domain/spectrogram representation: the bottom of the U operates on waveforms with 1D convolutions and self-attention, while a parallel branch processes spectrograms with 2D convolutions; the two branches share information at every layer. Training is supervised on MUSDB18 plus large-scale internal data; the model emits four 44.1 kHz stereo stems given a mixed input. On modern hardware, htdemucs separates a 4-minute song in under 30 seconds on a single GPU.

Key Insight: Why separation is a multimodal masking problem

Source separation looks like a regression problem (predict four output waveforms from one input waveform), but it is fundamentally a masking problem over a learned representation. The U-Net or transformer extracts a representation of the mix, predicts a per-source soft mask, applies that mask to the representation, and inverts the representation back to time domain. The mask is the unknown the model is trained to learn. This is structurally identical to image-segmentation masks (Section 33.4 covers the video version) and to attention masks in transformers (Section 3.5): the model is deciding what to keep and what to suppress. Once you see the masking thesis, the connection between separation and inpainting in the next section becomes obvious.

The Separation Objective: SI-SDR

Naming Demucs and BS-RoFormer tells you which models to download but not what they are trained to maximize. A naive choice is mean-squared error between the estimated stem $\hat{s}$ and the reference $s$. That objective has a fatal flaw for audio: it is sensitive to overall gain. An estimate that recovers the vocal perfectly but at half the loudness is musically correct yet incurs a large MSE, so the loss would push the model to match absolute amplitude rather than waveform shape. The field's answer is the scale-invariant signal-to-distortion ratio (SI-SDR), which first projects the estimate onto the reference to absorb any gain mismatch, then measures the residual. Define the projection of the estimate onto the reference and the residual error as

$$ s_{target} = \frac{\langle \hat{s}, s\rangle}{\lVert s\rVert^2}\, s, \qquad e_{noise} = \hat{s} - s_{target}, $$

where $\langle \hat{s}, s\rangle$ is the inner product of the two waveforms. The term $s_{target}$ is the component of the estimate that lies along the reference (the part we want), and $e_{noise}$ is everything orthogonal to it (distortion, interference from other stems, and artifacts). The metric is the log-ratio of their energies:

$$ \text{SI-SDR} = 10\log_{10}\frac{\lVert s_{target}\rVert^2}{\lVert e_{noise}\rVert^2}. $$

Because $s_{target}$ rescales the reference by exactly the factor $\alpha = \langle \hat{s}, s\rangle / \lVert s\rVert^2$ that best matches $\hat{s}$, multiplying $\hat{s}$ by any constant leaves the ratio unchanged: that is the scale invariance, and it is why SI-SDR, not raw SDR or MSE, is the standard both as the reported metric on MUSDB18 and (negated, since training minimizes) as the training loss. Higher is better; a vocal stem in the high teens of decibels is excellent, while single digits are audibly degraded. As a micro-check, if $\hat{s} = 2s$ exactly, then $\alpha = 2$, $s_{target} = 2s = \hat{s}$, $e_{noise} = 0$, and SI-SDR diverges to $+\infty$: a doubled-but-otherwise-perfect estimate scores perfectly, exactly as scale invariance demands.

Mask-Based Separation and Band-Splitting

The Key Insight above framed separation as predicting a mask; here is the operation in symbols. Let $X \in \mathbb{C}^{F \times T}$ be the complex (or magnitude) short-time Fourier transform of the mixture, with $F$ frequency bins and $T$ time frames. For each source the model predicts a time-frequency mask $M$ of the same shape, and the estimated source spectrogram is the elementwise (Hadamard) product

$$ \hat{S} = M \odot X. $$

A binary mask sets $M_{f,t} \in \{0, 1\}$, assigning each cell entirely to one source; this is cheap but introduces musical-noise artifacts at cell boundaries where energy is shared. A soft mask lets $M_{f,t} \in [0, 1]$ (or unbounded complex values for phase-aware masks), so overlapping sources split the energy of a shared cell proportionally, which is what modern separators predict. Inverting $\hat{S}$ back to the time domain through the inverse STFT yields the stem that SI-SDR scores. The band-split idea behind BS-RoFormer and the hybrid Demucs branch refines this: rather than masking the full spectrogram with one network, split the frequency axis into contiguous sub-bands and process each with its own transformer or convolutional stack before recombining. This matches the non-uniform resolution music demands, since a bass line lives in a few low bins while cymbals spread across the entire high end, so narrow low-frequency bands and wide high-frequency bands each get capacity proportioned to their information content.

# Stem separation with Demucs v4 (Hybrid Transformer Demucs)
# pip install -U demucs
from demucs.api import Separator
import torch
import soundfile as sf

# htdemucs_ft is the four-stem fine-tuned model: vocals, drums, bass, other.
separator = Separator(
    model="htdemucs_ft",
    device="cuda" if torch.cuda.is_available() else "cpu",
    shifts=2,          # test-time augmentation; trades latency for quality
    split=True,        # chunk long inputs to control memory
)

# Returns a dict {"vocals": tensor, "drums": tensor, "bass": tensor, "other": tensor}
origin, separated = separator.separate_audio_file("song.wav")

for stem_name, tensor in separated.items():
    # tensor shape: (channels=2, samples)
    sf.write(
        f"stems/{stem_name}.wav",
        tensor.cpu().numpy().T,
        separator.samplerate,
    )
    print(f"  {stem_name:<8} -> stems/{stem_name}.wav  {tensor.shape[1] / separator.samplerate:.1f}s")

Output: vocals -> stems/vocals.wav 234.2s drums -> stems/drums.wav 234.2s bass -> stems/bass.wav 234.2s other -> stems/other.wav 234.2s

Code Fragment 20.4.1: Four-stem separation with Demucs v4. The shifts=2 flag runs the model twice on time-shifted copies and averages the outputs, giving roughly a 0.3-0.5 dB SDR (Signal-to-Distortion Ratio, the standard metric for source-separation quality; higher is better) improvement at the cost of 2x latency. For most production workflows this is worth it.

20.4.2 Audio Inpainting and Repair

Audio inpainting fills in missing or corrupted segments of a recording. Use cases include click and pop removal (vinyl transfers, old archive recordings), dropout repair (cellular call recordings), removing unwanted spoken words from podcast tracks ("um", profanity, name mentions), and reconstructing damaged historical recordings. Classical DSP handles short gaps (sub-50 ms) acceptably; longer gaps require generative models that fill in plausible content given the surrounding context.

The 2024 state of the art uses one of two recipes. The first is masked codec-LM infilling: encode the audio with EnCodec or DAC, mask the codec tokens covering the corrupted region, and use a bidirectional (BERT-style) or autoregressive transformer to fill them in. AudioLM-derived models and recent VALL-E variants use this pattern. The second is diffusion-based inpainting: the audio is processed through a diffusion model whose conditioning includes the surrounding clean context, and the model denoises the corrupted region back to a plausible reconstruction. Stable Audio Open supports this, as does Adobe's "Generative Audio Fill" feature in Premiere Pro (introduced 2024).

The qualitative line between repair and creative generation is fuzzy. Filling in a 100 ms click is "repair"; filling in a 5-second gap with newly generated music in the same style as the surrounding mix is "extension"; the underlying model is the same. The product framing depends on the gap length and the user's intent.

20.4.3 Style Transfer and Timbre Conversion

Audio style transfer takes a source audio (the "content") and reshapes it to match a target reference (the "style") while preserving the source's musical content. The taxonomy splits into instrument timbre transfer (drums in the style of these other drums; piano sounding like organ), voice timbre transfer (the source's words and prosody, with a different vocal character; Section 20.2 covered the cloning side), and ambient style transfer (a dry recording reshaped to sound like it was recorded in a cathedral, or through a guitar amp).

The 2024 approach that has displaced earlier WaveNet-and-CycleGAN methods is the same one driving image style transfer: a diffusion model conditioned on a style reference embedding. The model encodes the style reference with a pretrained audio embedding model (CLAP, AudioCLIP, or a music-specific equivalent), denoises the source audio guided by both the content reference and the style embedding, and outputs a hybrid that combines both. Stable Audio 2.5's "style transfer" mode and Suno's V5 "remix in genre" feature are productionized versions of this.

Real-World Scenario: Podcast Vocal Repair Pipeline

A podcast production company processing 30 hours of weekly content runs every track through a four-stage neural pipeline before editing: (1) Demucs separates background music from speech if the host accidentally recorded over playing music, (2) iZotope RX 11 (with neural Voice De-noise, Mouth De-click, and De-reverb modules) cleans the speech stem, (3) an internal inpainting model fills in dropouts longer than 50 ms with generated audio matching the surrounding voice, (4) a Cartesia-based style transfer module matches all speakers to a consistent loudness and tonal balance. The end-to-end pipeline costs about $0.10 per hour of audio in compute and replaces what used to be two full-time human audio engineers. The human team now focuses on creative editing rather than cleanup.

20.4.4 Pitch, Time, and Classical DSP with Neural Replacements

The classical audio DSP toolbox (pitch shift, time stretch, equalization, compression, reverb, denoising) is being progressively replaced by neural equivalents that produce subjectively better results, especially at extreme settings. Neural pitch shift (e.g., Audio Toolkit's NeuralPitch, Soundtoys Little AlterBoy) preserves formants and timbre across multi-octave shifts in a way that classical phase vocoders cannot. Neural time stretch (Antares ATR, iZotope Radius) avoids the metallic artifacts of phase-vocoder stretches at 4x or longer factors. Neural denoising (NVIDIA RTX Voice, Krisp, iZotope RX 11 Voice De-noise) removes background hum, hiss, traffic, and room reverb with quality and CPU efficiency that classical Wiener filtering cannot approach.

The training pattern is the same in all cases: collect paired data (clean source, processed target), train a U-Net or small transformer in time-frequency domain, and ship it as a real-time-capable processor. The interesting twist for 2025-2026 is that these models are getting small enough to ship as DAW plugins that run locally on a CPU; iZotope RX 11 ships with several million-parameter models that process in real time on a M2 MacBook.

20.4.5 The Remix Stack: From Stems to Published Track

A common 2026 production workflow takes an existing song, separates it into stems, applies neural processing to one or more stems, and remixes the result. The use cases include: a DJ extracting an acapella for a custom backing track, a producer replacing the drums in a sample with newly programmed drums, a film editor isolating dialogue from a TV broadcast for use in a documentary, or a karaoke service generating instrumentals from songs that never released official karaoke versions.

The legal status of these workflows is jurisdictionally inconsistent. The US considers extracting stems from a copyrighted recording and republishing them a derivative work that requires a master-recording license; the same is true in the UK and EU. Pure personal use is generally tolerated; commercial republication is not. The 2024 rise of consumer-grade stem separation (Moises, LALAL.AI, Spotify's release of stems from a small catalog of licensed tracks) is forcing the industry to develop new licensing models for stem-level distribution, which Suno V5 and Spotify Stem Library (announced 2025) are early productions of.

Task	Open Model	Commercial Plugin	Real-Time?
4-stem separation	Demucs v4 (htdemucs_ft)	iZotope RX 11 Music Rebalance	Batch (~30 s for 4 min song)
Vocal separation only	BS-RoFormer (UVR)	LALAL.AI, Moises	Batch
Voice de-noising	RNNoise (legacy), DeepFilterNet 3	NVIDIA RTX Voice, Krisp	Yes (sub-10 ms)
De-reverb	DTLN-aec variants	Adobe Enhance Speech, iZotope De-reverb	Batch or near-real-time
Pitch / time stretch	NN-pitch, RealTimeSTT	Antares ATR, Melodyne	Yes (offline higher quality)
Click / pop removal	RX Spectral Repair (heuristic)	iZotope RX 11 Mouth De-click	Real-time
Audio inpainting (long)	Stable Audio Open Inpaint	Adobe Generative Audio Fill	Batch
Style transfer	Stable Audio Style	Suno V5 Remix mode	Batch

Figure 20.4.1a: 2026 audio editing matrix. Real-time-capable models (denoising, pitch, click removal) ship as DAW plugins; batch models (separation, inpainting, style transfer) typically run server-side or as offline DAW commands.

Warning: Separation residuals do not vanish

Even the best 2025 stem separation leaves audible artifacts in the residual stems. When you isolate vocals and listen to the "other" stem, you will hear faint vocal bleed; when you isolate drums, you will hear pitched-content artifacts in the cymbals. For consumer-facing remix products this is usually fine. For sample-licensing workflows (selling extracted stems as standalone material), the residual artifacts are a brand-quality problem and typically require additional human polishing. Always preview the residuals, not just the foreground stems, before shipping anything separation-derived to production.

Research Frontier: The Emerging LLM-Driven Editing Interface

The 2025 trend in audio editing is conversational interfaces on top of the underlying models. Instead of dragging sliders in a DAW, the user types "remove the second 'um' at 1:23 and clean up the breath noise before the chorus" and a multimodal agent (an LLM with audio tool access) finds the events, applies the appropriate processors, and renders. Adobe's Project Music GenAI Control (announced 2024) and Descript's Underlord (2024-2025) are the consumer-facing productionizations of this pattern. The agent is doing standard tool-calling (Section 26): it has access to a stem-separation tool, a click-removal tool, a noise-reduction tool, and a re-mix tool, and the LLM decides which to invoke in what order given the natural-language request.

The interface shift matters because audio editing has historically had a steep learning curve. A typical DAW exposes 50+ different processors with hundreds of parameters; learning when to use which is a multi-year apprenticeship. Conversational interfaces collapse the parameter surface to a natural language layer, letting non-engineers achieve results that previously required studio expertise. This is the audio equivalent of the no-code revolution and will reshape the audio professional's job description over the next three to five years.

Fun Fact: The "uh" problem

Removing speech disfluencies ("um", "uh", "you know") from podcast audio used to be a manual chore: scrub the timeline, locate each one by ear, select it, and delete. A 2025 model from Descript automates this end-to-end: it transcribes the audio with Whisper, classifies each word as content or disfluency using a small lexical model, deletes the disfluency segments, and uses an inpainting model to smooth the resulting cuts so listeners cannot hear the edit. The default setting removes about 70% of disfluencies; the aggressive setting can make a rambling 90-minute recording sound like a polished TED Talk. Whether that is good for the medium is debatable, but the technology works.

Key Insight

Audio editing in 2026 is a small set of neural models (Demucs and BS-RoFormer for separation, DeepFilterNet and Voice De-noise for cleanup, Stable Audio Inpaint and Generative Audio Fill for repair, Stable Audio Style and Suno Remix for transfer) wrapped in conversational interfaces. Most production workflows chain three to five of these models before any human editing. The underlying machinery is the same masking, codec-LM, and diffusion technology that drives generation; the difference is the loss function (reconstruction or transfer rather than next-token) and the I/O contract (audio in plus audio out, not text in plus audio out).

Self-Check

Q1: Demucs v4 uses a hybrid time-domain plus spectrogram U-Net. Sketch why this is more robust than a pure spectrogram model (like the original Demucs v1) for separating drums, and explain what time-domain information is encoded that the spectrogram loses.

Show Answer

Drums are dominated by sharp transients (kick strike, snare crack, hi-hat tick) whose energy is concentrated in just a few milliseconds. A spectrogram with the usual 2048-sample FFT smears each transient across several frequency bins and a 40-50 ms window, so the model has to reconstruct the original click shape from a blurred image, and phase information is discarded by the magnitude STFT. The time-domain branch operates on raw samples, retains the exact onset shape and the phase relationships across stereo channels, and feeds that into the same U-Net that processes the spectrogram. The fusion lets the model use spectrogram cues for tonal sources (vocals, sustained instruments) and time-domain cues for percussive transients, which is why drums and bass separation improved most between Demucs v1 and v4.

Q2: The masked codec-LM inpainting recipe (Section 2) is structurally identical to BERT's masked language modeling. Explain how you would adapt a pretrained MusicGen checkpoint into an inpainting model with minimal retraining.

Show Answer

MusicGen is already a codec-LM over EnCodec tokens conditioned on text, but its training objective is left-to-right next-token prediction, not masked infilling. To adapt it with minimal retraining, switch the attention mask from causal to bidirectional, sample masked spans of contiguous codec tokens during fine-tuning (Bart or T5 style span masking is the natural fit), and add a sentinel token so the model knows where to fill. Keep the text-conditioning head intact so the user can describe the desired fill in natural language. Roughly 5-10% of the original pretraining compute on a music dataset is usually enough to convert a generation checkpoint into a competent inpainter, because the codec representation, the text encoder, and most of the transformer weights transfer; only the attention pattern and the loss change materially.

Q3: The legal status of stem separation differs from the legal status of source-to-source style transfer. Identify the underlying copyright doctrine that produces this difference, and predict how the 2025 Spotify Stem Library is likely to evolve under it.

Show Answer

The doctrine is the distinction between a derivative work and a transformative work. Stem separation produces a derivative work: each stem is a recognizable, isolated portion of the original master recording, and US, UK, and EU law require a master-use license to republish such portions commercially. Style transfer produces output that is sufficiently transformed (different content, different waveform, different copyrightable elements) that it usually does not infringe the source master, provided no audible verbatim copy survives. The Spotify Stem Library is therefore likely to evolve as a licensed-by-default channel: labels opt their masters in, Spotify negotiates blanket stem-license terms with the relevant rightsholders, and consumer remixers pay a share that flows back to the original master and composition owners. Style-transfer-driven products will continue to ship without per-track licensing because they fall on the transformative side of the line, though the training-data question (Section 20.3) remains open.

In the next section, Section 20.5: Speech Recognition for the Multimodal Stack, we continue.

What Comes Next

Section 20.5 closes the audio chapter with the modality that everything else depends on: speech recognition. Whisper, faster-whisper, and the commercial ASR APIs (AssemblyAI, Deepgram, Speechmatics) are the eyes (or rather, ears) of every multimodal agent.

Further Reading

Defossez, A. et al. (2022). Hybrid Spectrogram and Waveform Source Separation (Demucs v4). arXiv:2211.08553. The Demucs v4 / htdemucs paper; current open-source default for music source separation.
Lu, W. et al. (2024). BS-RoFormer: Music Source Separation with Band-Split Rotary Position Embedding. ICASSP 2024. The model behind the leading scores on MUSDB18 in 2024.
Schneider, F. et al. (2024). DeepFilterNet 3: Real-Time Speech Denoising with a Two-Stage Filter. ICASSP 2024. The current open-source state of the art in real-time speech denoising.
Adobe (2024). Project Music GenAI Control: Conversational Music Generation and Editing. research.adobe.com. The Adobe Research preview that productionized the conversational-editing interface pattern.
iZotope (2024). RX 11 Audio Repair Suite Technical Overview. izotope.com/rx. The reference commercial audio-repair plugin; documents which neural models ship inside.
Descript (2025). Underlord: AI Audio Editing Agent. descript.com/underlord. The LLM-driven editing agent reference; shows how tool-calling drives a DAW-replacement product.
Mitsufuji, Y. et al. (2024). Sound Source Separation Survey: From Independent Component Analysis to Diffusion Models. APSIPA Transactions on Signal and Information Processing. A 2024 survey covering the field's progression from classical DSP to diffusion.
Stability AI (2024). Stable Audio Open: Audio Inpainting and Style Transfer Documentation. stability.ai. Open-weights reference for diffusion-based audio inpainting and style transfer.