Section G.4: Z-Transform Primer | Building Language AI

The DFT and FFT of Section G.2 work on finite, bounded sequences, which covers nearly every signal an audio model will ever see. But the underlying theory of why these transforms diagonalize convolution, why some filters are stable and others diverge, and why resampling produces the spectral artifacts it does, is most cleanly expressed in the Z-transform. This section gives the minimum vocabulary needed to read the digital-filter literature and to understand the design choices behind audio resampling, denoising, and effects.

Definition

For a discrete-time sequence $x[n]$ defined on the integers, the two-sided Z-transform is

X(z) = \sum_{n = -\infty}^{+\infty} x[n] \, z^{-n} ,

where $z$ is a complex variable. The Z-transform embeds a discrete sequence into a continuous complex-valued function of $z$, which is exactly what lets continuous-function machinery (polynomial roots, contour integrals, factorization) be brought to bear on inherently discrete operations. It is the discrete-time analogue of the Laplace transform: the substitution $z = e^{s T_s}$, where $T_s$ is the sample period, converts the Laplace transform of a sampled signal directly into the Z-transform.

Region of Convergence

The series above does not converge for every $z$. The set of complex numbers for which it does is the region of convergence (ROC), and it is always an annulus $r_1 < |z| < r_2$ centered on the origin. The ROC is part of the data: the same algebraic expression $X(z)$ can correspond to two different sequences with different ROCs (one right-sided, one left-sided). For causal sequences (those with $x[n] = 0$ for $n < 0$), the ROC is the exterior of a disk, $|z| > r_1$; for finite-length sequences, the ROC is the entire $z$-plane except possibly $z = 0$ or $z = \infty$.

Pole-Zero Plots

The Z-transform of any rational LTI system is a ratio of polynomials in $z^{-1}$:

H(z) = \frac{B(z)}{A(z)} = \frac{b_0 + b_1 z^{-1} + \cdots + b_M z^{-M}}{1 + a_1 z^{-1} + \cdots + a_N z^{-N}} .

The roots of $B(z)$ are zeros (values of $z$ for which $H(z) = 0$) and the roots of $A(z)$ are poles (values where $H(z) \to \infty$). Plotting these on the complex plane gives the pole-zero diagram, the single most useful visualization in digital-filter design. Zeros carve notches into the frequency response; poles create resonant peaks. Resonant audio effects (vowel-formant synthesis, EQ peaks, wah-wah pedals) are literally pole placements near the unit circle.

The Unit Circle Is the DTFT

Substitute $z = e^{j \omega}$ in the Z-transform definition. The result is

X(e^{j \omega}) = \sum_{n} x[n] \, e^{-j \omega n} ,

which is exactly the discrete-time Fourier transform (DTFT) of $x[n]$ at angular frequency $\omega$. The unit circle $|z| = 1$ on the $z$-plane is therefore the frequency axis: traversing it from $\omega = 0$ (the point $z = 1$) to $\omega = \pi$ (the point $z = -1$) sweeps the DTFT from DC to the Nyquist frequency. The DFT is just the DTFT sampled at $N$ equally spaced points on this circle, $\omega_k = 2 \pi k / N$. This is the precise sense in which "the FFT lives on the unit circle": the DFT bins are uniformly spaced samples of $H(z)$ along $|z| = 1$.

The unit circle also separates stable from unstable systems. A causal LTI filter is stable (bounded input gives bounded output, BIBO) if and only if all of its poles lie strictly inside the unit circle, $|z_{\text{pole}}| < 1$. A pole on the circle gives a marginally stable resonator (an undamped oscillator); a pole outside gives an exploding response. Filter designers therefore think of stability as a purely geometric condition on the pole positions, which is enormously easier to reason about than a time-domain stability proof.

Connection to Digital Filter Design and Audio Effects

Every digital audio effect in widespread use, low-pass and high-pass filters for resampling, parametric EQ for mixing, reverb and chorus, comb filters for pitch detection, anti-alias filters in front of the analog-to-digital converter, can be specified by where its designer places poles and zeros on the $z$-plane. The two big families are finite impulse response (FIR) filters, which have only zeros (their denominator is $A(z) = 1$) and are unconditionally stable but require many taps for sharp cutoffs, and infinite impulse response (IIR) filters, which have poles and zeros, achieve sharp cutoffs with few coefficients, and require pole-placement care to remain stable. The resampling stages that take 48 kHz audio down to 16 kHz for Whisper, the band-limiting filters that prevent aliasing in mel filter banks, and the low-pass reconstruction filters in neural vocoders are all FIR or IIR designs justified by exactly this $z$-plane geometry.

See Also

The Z-transform pole-zero geometry developed here is also the design language of neural audio codecs covered in Section 20.0.2. The analysis and synthesis filter banks inside EnCodec and SoundStream are stacks of FIR and IIR designs whose pole placements determine their reconstruction quality, latency, and stability; the same unit-circle picture that explains a wah-wah pedal's resonance also explains why a codec's analysis filter is causal and minimum-phase. Readers studying codec models in Chapter 20 will recognise every term used here.

Key Insight

The Z-transform embeds any discrete-time sequence into a complex-valued function whose pole-zero geometry on the $z$-plane fully describes any LTI filter: poles inside the unit circle give stability, poles on the circle give resonance, the unit circle itself is the DTFT, and the DFT bins are uniformly spaced samples of it. Every digital audio effect, resampler, and anti-alias filter in the audio pipelines of Chapter 20 is specified by exactly this geometry.

Exercise G.4.1: Hear the Pole-Zero Map

Objective. Make the pole-zero abstraction audible by designing two filters and listening to the difference.

Task. Generate 3 seconds of white noise at 16 kHz. Then: (a) design a 6th-order Butterworth low-pass filter with scipy.signal.butter(6, 1000, btype="low", fs=16000); (b) design a narrow band-pass with center 1000 Hz and Q = 30 via scipy.signal.iirpeak. Plot the pole-zero map of each with scipy.signal.zplane-style code (or use matplotlib on the returned (b, a) coefficients). Apply each filter to the noise via scipy.signal.lfilter, save as WAV, and listen. Describe in one sentence per filter how the pole geometry matches the sound.

Stretch. Move one of the band-pass poles slightly outside the unit circle and confirm that the filter output diverges over time, validating the stability rule.

Further Reading

Foundational Texts

Oppenheim, A. V., & Schafer, R. W. (2010). Discrete-Time Signal Processing (3rd ed.). Pearson. Pearson catalogue entry. The canonical graduate textbook for the DFT, FFT, STFT, Z-transform, and digital filter design; every claim in this appendix can be traced to a chapter here.

Stevens, S. S., Volkmann, J., & Newman, E. B. (1937). "A Scale for the Measurement of the Psychological Magnitude Pitch." The Journal of the Acoustical Society of America 8(3), 185-190. doi:10.1121/1.1915893. The original psychophysical experiments that defined the mel scale; the constants in the modern closed-form mel formula trace back to this paper.

Audio Machine Learning

McFee, B., Raffel, C., Liang, D., Ellis, D. P. W., McVicar, M., Battenberg, E., & Nieto, O. (2015). "librosa: Audio and Music Signal Analysis in Python." Proceedings of the 14th Python in Science Conference (SciPy 2015). librosa documentation. The reference implementation for all of the operations in this appendix (STFT, mel filter bank, MFCC) and the standard audio-feature toolkit in Python.

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision." OpenAI Technical Report. arXiv:2212.04356. The Whisper paper; documents the exact 25 ms / 10 ms / 80-mel feature pipeline used as the running example throughout this appendix.

Datasets and Sample-Rate Conventions

Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). "LibriSpeech: An ASR Corpus Based on Public Domain Audio Books." ICASSP 2015, 5206-5210. openslr.org/12. The reference English ASR benchmark, distributed at 16 kHz, which fixed the modern speech-recognition sample-rate convention used by nearly every model in Chapter 20.

Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M., & Weber, G. (2020). "Common Voice: A Massively-Multilingual Speech Corpus." LREC 2020. commonvoice.mozilla.org. Mozilla's open multilingual speech dataset, distributed at 48 kHz and routinely resampled to 16 kHz for Whisper-class models; a working example of the anti-alias filtering discussed in this appendix.