Section G.2: The DFT and FFT | Building Language AI

Once a signal has been sampled and split into frames (Section G.1), each frame is converted to a frequency-domain representation by the discrete Fourier transform (DFT). The DFT is, in a precise sense, a change of basis: it expresses the $N$-sample frame as a sum of $N$ complex exponentials, each oscillating at a fixed rate. Because complex exponentials are the eigenfunctions of every linear time-invariant operator, the DFT is the canonical representation in which filtering, convolution, and resonance all become pointwise multiplication.

Definition

For a length-$N$ sequence $x[0], x[1], \ldots, x[N-1]$, the DFT produces $N$ complex coefficients

X[k] = \sum_{n=0}^{N-1} x[n] \, e^{-j 2 \pi k n / N}, \qquad k = 0, 1, \ldots, N-1 .

The inverse DFT recovers the time samples by reversing the sign in the exponent and dividing by $N$:

x[n] = \frac{1}{N} \sum_{k=0}^{N-1} X[k] \, e^{+j 2 \pi k n / N} .

(Different conventions split the $1/N$ between forward and inverse transforms; only the product matters, and many DSP libraries use the convention shown here, while NumPy's np.fft puts the entire factor on the inverse.) The coefficient $X[0]$ is the DC component, equal to the sum of the samples; for a zero-mean audio frame it is close to zero. For a real-valued signal, $X[N - k] = \overline{X[k]}$, so the upper half of the spectrum is redundant and only the first $N/2 + 1$ bins carry independent information.

What Each Bin Means

Bin index $k$ corresponds to the physical frequency

f_k = \frac{k \, f_s}{N} \quad \text{hertz}, \qquad k = 0, 1, \ldots, N/2 .

The frequency resolution is therefore $\Delta f = f_s / N$ hertz per bin, and the highest representable frequency is the Nyquist limit $f_{N/2} = f_s / 2$ (Section G.1). At $f_s = 16{,}000$ Hz and $N = 400$, the bin spacing is 40 Hz; padding to $N_{\text{FFT}} = 512$ before transforming gives a finer 31.25 Hz grid, but the underlying resolvable spacing is still set by the unpadded frame length. Figure G.2.1 lays the padded grid out explicitly: 257 unique bins for a 512-point FFT, spaced 31.25 Hz apart, stopping at the Nyquist limit of 8 kHz.

Chart showing the FFT bin grid for N equals 512 at 16 kHz: a row of vertical lines marking 257 unique frequency bins spaced 31.25 Hz apart, with selected bins (0, 1, 2, 8, 32, 64, 128, 192, 256) highlighted and annotated, and a dashed line marking the Nyquist frequency at 8000 Hz.

Figure G.2.1: FFT bin grid for $N = 512$ at $f_s = 16$ kHz. Each bin sits at $f_k = k \cdot f_s / N$, giving a uniform spacing of $\Delta f = 31.25$ Hz from DC through the Nyquist limit $f_s / 2 = 8000$ Hz.

Frequency resolution and time resolution trade off directly. Longer frames yield finer frequency bins but blur temporal events; shorter frames preserve onsets and transients but smear spectral peaks. This is a hard limit, sometimes called the time-frequency uncertainty principle, and it is the reason any one spectrogram is a compromise.

Practical Example: one DFT bin, by hand

Take the tiny length-4 signal $x = [1, 0, -1, 0]$, sampled at some rate $f_s$. The bin-$k$ DFT coefficient is $X[k] = \sum_{n=0}^{3} x[n] \, e^{-j 2 \pi k n / 4}$. Bin 1 picks out the component oscillating once per frame, that is, at $f_1 = f_s / 4$:

X[1] = 1 \cdot e^{0} + 0 \cdot e^{-j \pi / 2} + (-1) \cdot e^{-j \pi} + 0 \cdot e^{-j 3 \pi / 2} = 1 - (-1) = 2 .

So $|X[1]| = 2$, the largest of any bin, which correctly reports that the signal is a pure cosine at frequency $f_s / 4$ (one full period in four samples). The other bins evaluate to $X[0] = 0$ (no DC), $X[2] = 0$ (no Nyquist tone), and $X[3] = \overline{X[1]} = 2$ (the mirror of bin 1, redundant for a real signal). Every bin in a 512-point FFT is computed by the same formula; only the bookkeeping is heavier.

Why the FFT Is $O(N \log N)$, Not $O(N^2)$

Computing the DFT directly from its definition costs $N$ multiplications per bin times $N$ bins, that is, $O(N^2)$. The fast Fourier transform (FFT), in its Cooley-Tukey form, exploits the fact that an $N$-point DFT can be expressed as two interleaved $N/2$-point DFTs (one over the even-indexed samples and one over the odd-indexed samples) plus $N$ "twiddle" multiplications and additions. Recursing on this decomposition gives the recurrence $T(N) = 2 T(N/2) + O(N)$, which solves to $T(N) = O(N \log N)$. For a 512-point frame this is about a 50x speedup over the direct sum; for the kind of $N = 2^{14}$ frames used in music analysis it is more than a thousand-fold. The FFT requires $N$ to be highly composite (a power of two is the easiest case), which is why audio pipelines zero-pad to the next power of two before transforming.

The Short-Time Fourier Transform

Applying a DFT independently to each windowed, hopped frame of a long signal gives the short-time Fourier transform (STFT):

X[m, k] = \sum_{n=0}^{N-1} w[n] \, x[m H + n] \, e^{-j 2 \pi k n / N} ,

where $m$ indexes the frame, $H$ is the hop length, and $w[n]$ is the analysis window from Section G.1. The output $X[m, k]$ is a 2D complex array of shape $(\text{num\_frames}, N/2 + 1)$ called a spectrogram when its magnitude is plotted on a time-frequency grid. The STFT is the workhorse representation behind nearly every audio model in the book.

Magnitude vs. Phase: Why Audio ML Drops Phase

Each STFT coefficient $X[m, k] = |X[m, k]| \, e^{j \angle X[m, k]}$ decomposes into a non-negative magnitude and a phase angle. For perception and for most discriminative audio tasks (speech recognition, speaker identification, music tagging), magnitude carries almost all the relevant information; phase information is heavily redundant with magnitude in natural signals and is dominated by uninformative shifts. The standard input to discriminative audio models is therefore $|X[m, k]|^2$ (the power spectrogram) or its mel-warped, log-compressed cousin (Section G.3). Phase becomes essential again only when the model must synthesize audio, where dedicated vocoders (Griffin-Lim, HiFi-GAN, EnCodec) are responsible for reconstructing it from a magnitude representation.

DFT and FFT

Cooley, J. W., & Tukey, J. W. (1965). "An Algorithm for the Machine Calculation of Complex Fourier Series." Mathematics of Computation 19(90), 297-301. doi:10.1090/S0025-5718-1965-0178586-1. The original FFT paper; the divide-and-conquer recursion behind every torch.fft.rfft call.

Frigo, M., & Johnson, S. G. (2005). "The Design and Implementation of FFTW3." Proceedings of the IEEE 93(2), 216-231. fftw.org. The reference C library that underlies most production FFT pipelines, including NumPy and SciPy.

Allen, J. B. (1977). "Short Term Spectral Analysis, Synthesis, and Modification by Discrete Fourier Transform." IEEE Transactions on Acoustics, Speech, and Signal Processing 25(3), 235-238. doi:10.1109/TASSP.1977.1162950. The classic paper that formalizes the STFT with overlap-add as a time-frequency representation.

Modern Implementations

PyTorch contributors. (2024). "torch.fft and torch.stft documentation." pytorch.org/docs/stable/fft.html. Official documentation for the FFT and STFT primitives used in the running PyTorch examples.

Griffin, D., & Lim, J. (1984). "Signal Estimation from Modified Short-Time Fourier Transform." IEEE Transactions on Acoustics, Speech, and Signal Processing 32(2), 236-243. doi:10.1109/TASSP.1984.1164317. The standard phase-reconstruction algorithm referenced in the discussion of magnitude-only spectrograms and neural vocoders.