Appendix G: Signal Processing for Audio

"Give me a waveform and I will give you a spectrogram. Give me a spectrogram and I will give you eighty mel bins, a logarithm, and a confidently wrong transcription. The signal was always there; we just had to warp it until a transformer could love it."
Spectra, Frequency-Obsessed Audio Agent

Big Picture

Why a signal-processing appendix in an LLM book? Because every audio model in Chapter 20 (Whisper, AudioLM, MusicGen, Bark) ingests log-mel spectrograms, not raw waveforms. The transformer encoder never sees pressure samples; it sees a stack of perceptually warped frequency bins computed from short overlapping windows of the signal. This appendix collects the four ideas that make that pipeline make sense: how a continuous waveform becomes a discrete sample sequence (G.1), how the DFT and FFT turn each short frame into a frequency spectrum (G.2), how mel filtering and a log compress that spectrum into something an ear (and a transformer) can use (G.3), and how the Z-transform unifies all of the above and underlies the digital filters used for resampling, denoising, and audio effects (G.4).

This appendix is deliberately compact: roughly two pages of dense reference material, not a textbook. Readers who want more depth should consult Oppenheim and Schafer or work through the librosa tutorial. Readers in a hurry should read G.1 and G.3 in full and skim G.2 and G.4.

Note: Prerequisites

This appendix assumes the linear-algebra and calculus reminders in Appendix A, particularly inner products as projections and the basics of complex exponentials. No prior signal-processing course is required; everything is built from the eigenfunction-of-LTI-operators view, which is the same conceptual lens used in the source lectures.

Sections