The DFT and FFT

Section G.2

Once a signal has been sampled and split into frames (Section G.1), each frame is converted to a frequency-domain representation by the discrete Fourier transform (DFT). The DFT is, in a precise sense, a change of basis: it expresses the $N$-sample frame as a sum of $N$ complex exponentials, each oscillating at a fixed rate. Because complex exponentials are the eigenfunctions of every linear time-invariant operator, the DFT is the canonical representation in which filtering, convolution, and resonance all become pointwise multiplication.

Definition

For a length-$N$ sequence $x[0], x[1], \ldots, x[N-1]$, the DFT produces $N$ complex coefficients

$$X[k] = \sum_{n=0}^{N-1} x[n] \, e^{-j 2 \pi k n / N}, \qquad k = 0, 1, \ldots, N-1 .$$

The inverse DFT recovers the time samples by reversing the sign in the exponent and dividing by $N$:

$$x[n] = \frac{1}{N} \sum_{k=0}^{N-1} X[k] \, e^{+j 2 \pi k n / N} .$$

(Different conventions split the $1/N$ between forward and inverse transforms; only the product matters, and many DSP libraries use the convention shown here, while NumPy's np.fft puts the entire factor on the inverse.) The coefficient $X[0]$ is the DC component, equal to the sum of the samples; for a zero-mean audio frame it is close to zero. For a real-valued signal, $X[N - k] = \overline{X[k]}$, so the upper half of the spectrum is redundant and only the first $N/2 + 1$ bins carry independent information.

What Each Bin Means

Bin index $k$ corresponds to the physical frequency

$$f_k = \frac{k \, f_s}{N} \quad \text{hertz}, \qquad k = 0, 1, \ldots, N/2 .$$

The frequency resolution is therefore $\Delta f = f_s / N$ hertz per bin, and the highest representable frequency is the Nyquist limit $f_{N/2} = f_s / 2$ (Section G.1). At $f_s = 16{,}000$ Hz and $N = 400$, the bin spacing is 40 Hz; padding to $N_{\text{FFT}} = 512$ before transforming gives a finer 31.25 Hz grid, but the underlying resolvable spacing is still set by the unpadded frame length. Figure G.2.1 lays the padded grid out explicitly: 257 unique bins for a 512-point FFT, spaced 31.25 Hz apart, stopping at the Nyquist limit of 8 kHz.

Chart showing the FFT bin grid for N equals 512 at 16 kHz: a row of vertical lines marking 257 unique frequency bins spaced 31.25 Hz apart, with selected bins (0, 1, 2, 8, 32, 64, 128, 192, 256) highlighted and annotated, and a dashed line marking the Nyquist frequency at 8000 Hz.
Figure G.2.1: FFT bin grid for $N = 512$ at $f_s = 16$ kHz. Each bin sits at $f_k = k \cdot f_s / N$, giving a uniform spacing of $\Delta f = 31.25$ Hz from DC through the Nyquist limit $f_s / 2 = 8000$ Hz.

Frequency resolution and time resolution trade off directly. Longer frames yield finer frequency bins but blur temporal events; shorter frames preserve onsets and transients but smear spectral peaks. This is a hard limit, sometimes called the time-frequency uncertainty principle, and it is the reason any one spectrogram is a compromise.

Practical Example: one DFT bin, by hand

Take the tiny length-4 signal $x = [1, 0, -1, 0]$, sampled at some rate $f_s$. The bin-$k$ DFT coefficient is $X[k] = \sum_{n=0}^{3} x[n] \, e^{-j 2 \pi k n / 4}$. Bin 1 picks out the component oscillating once per frame, that is, at $f_1 = f_s / 4$:

$$X[1] = 1 \cdot e^{0} + 0 \cdot e^{-j \pi / 2} + (-1) \cdot e^{-j \pi} + 0 \cdot e^{-j 3 \pi / 2} = 1 - (-1) = 2 .$$

So $|X[1]| = 2$, the largest of any bin, which correctly reports that the signal is a pure cosine at frequency $f_s / 4$ (one full period in four samples). The other bins evaluate to $X[0] = 0$ (no DC), $X[2] = 0$ (no Nyquist tone), and $X[3] = \overline{X[1]} = 2$ (the mirror of bin 1, redundant for a real signal). Every bin in a 512-point FFT is computed by the same formula; only the bookkeeping is heavier.

Why the FFT Is $O(N \log N)$, Not $O(N^2)$

Computing the DFT directly from its definition costs $N$ multiplications per bin times $N$ bins, that is, $O(N^2)$. The fast Fourier transform (FFT), in its Cooley-Tukey form, exploits the fact that an $N$-point DFT can be expressed as two interleaved $N/2$-point DFTs (one over the even-indexed samples and one over the odd-indexed samples) plus $N$ "twiddle" multiplications and additions. Recursing on this decomposition gives the recurrence $T(N) = 2 T(N/2) + O(N)$, which solves to $T(N) = O(N \log N)$. For a 512-point frame this is about a 50x speedup over the direct sum; for the kind of $N = 2^{14}$ frames used in music analysis it is more than a thousand-fold. The FFT requires $N$ to be highly composite (a power of two is the easiest case), which is why audio pipelines zero-pad to the next power of two before transforming.

The Short-Time Fourier Transform

Applying a DFT independently to each windowed, hopped frame of a long signal gives the short-time Fourier transform (STFT):

$$X[m, k] = \sum_{n=0}^{N-1} w[n] \, x[m H + n] \, e^{-j 2 \pi k n / N} ,$$

where $m$ indexes the frame, $H$ is the hop length, and $w[n]$ is the analysis window from Section G.1. The output $X[m, k]$ is a 2D complex array of shape $(\text{num\_frames}, N/2 + 1)$ called a spectrogram when its magnitude is plotted on a time-frequency grid. The STFT is the workhorse representation behind nearly every audio model in the book.

Magnitude vs. Phase: Why Audio ML Drops Phase

Each STFT coefficient $X[m, k] = |X[m, k]| \, e^{j \angle X[m, k]}$ decomposes into a non-negative magnitude and a phase angle. For perception and for most discriminative audio tasks (speech recognition, speaker identification, music tagging), magnitude carries almost all the relevant information; phase information is heavily redundant with magnitude in natural signals and is dominated by uninformative shifts. The standard input to discriminative audio models is therefore $|X[m, k]|^2$ (the power spectrogram) or its mel-warped, log-compressed cousin (Section G.3). Phase becomes essential again only when the model must synthesize audio, where dedicated vocoders (Griffin-Lim, HiFi-GAN, EnCodec) are responsible for reconstructing it from a magnitude representation.

See Also

The STFT defined here is the immediate input to the mel filter bank covered in Section G.3, and the resulting mel spectrogram is what Whisper, AudioLM, and the other models in Section 20.5 actually consume. Phase reconstruction comes back in the neural-vocoder discussion of Section 20.0.1 (Audio Data).

Key Insight

The DFT is a change of basis from $N$ time samples to $N$ complex sinusoid amplitudes at frequencies $f_k = k f_s / N$, the FFT computes it in $O(N \log N)$ when $N$ is a power of two, and the STFT applies one DFT per windowed frame to produce the time-frequency tensor every audio model in this book ingests. Audio ML keeps only the magnitude and lets a vocoder reconstruct phase when synthesis is needed.

Exercise G.2.1: Two-Tone STFT and the Frequency-Resolution Trade-Off

Objective. Build intuition for the time-frequency uncertainty principle by computing the STFT of a known signal.

Task. Synthesize a 2-second signal at $f_s = 16{,}000$ that contains two summed sinusoids at 440 Hz and 460 Hz. Compute librosa.stft(y, n_fft=N, hop_length=N//4) for $N \in \{256, 1024, 4096\}$. For each, plot the log-magnitude spectrogram and report whether the two tones are resolvable in the frequency axis. Compute $\Delta f = f_s / N$ for each and tie the visual result back to the formula.

Stretch. Add a single 50 ms click at $t = 1.0\ \text{s}$. Plot all three spectrograms again and note how time localization of the click degrades as $N$ grows. This is the time-frequency uncertainty principle in action.

Further Reading

DFT and FFT

Cooley, J. W., & Tukey, J. W. (1965). "An Algorithm for the Machine Calculation of Complex Fourier Series." Mathematics of Computation 19(90), 297-301. doi:10.1090/S0025-5718-1965-0178586-1. The original FFT paper; the divide-and-conquer recursion behind every torch.fft.rfft call.
Frigo, M., & Johnson, S. G. (2005). "The Design and Implementation of FFTW3." Proceedings of the IEEE 93(2), 216-231. fftw.org. The reference C library that underlies most production FFT pipelines, including NumPy and SciPy.
Allen, J. B. (1977). "Short Term Spectral Analysis, Synthesis, and Modification by Discrete Fourier Transform." IEEE Transactions on Acoustics, Speech, and Signal Processing 25(3), 235-238. doi:10.1109/TASSP.1977.1162950. The classic paper that formalizes the STFT with overlap-add as a time-frequency representation.

Modern Implementations

PyTorch contributors. (2024). "torch.fft and torch.stft documentation." pytorch.org/docs/stable/fft.html. Official documentation for the FFT and STFT primitives used in the running PyTorch examples.
Griffin, D., & Lim, J. (1984). "Signal Estimation from Modified Short-Time Fourier Transform." IEEE Transactions on Acoustics, Speech, and Signal Processing 32(2), 236-243. doi:10.1109/TASSP.1984.1164317. The standard phase-reconstruction algorithm referenced in the discussion of magnitude-only spectrograms and neural vocoders.