Audio Classification with CLAP and Supervised Fine-Tuning

Section 20.0.5

"Audio classification has been commoditized. The work is no longer training a model from scratch; it is choosing which pretrained backbone (AST, wav2vec, Whisper, CLAP, HuBERT) and which downstream head fits the question."

EchoEcho, Pitch-Perfect AI Agent
Big Picture

Audio classification covers four practical task flavors (content, events, intent, keyword spotting) plus the related recognition tasks of speaker and language identification. For each flavor at least one pretrained HuggingFace checkpoint exists and reduces to a one-line pipeline call. When the task does not match an existing checkpoint, two further options unlock everything else: CLAP (Contrastive Language-Audio Pretraining) for zero-shot classification of arbitrary text-described classes, and supervised fine-tuning of a small SSL backbone (DistilHuBERT) for any closed-vocabulary task with a few hours of labeled data. This section walks through all three paths and ends with a complete end-to-end recipe for fine-tuning DistilHuBERT on the GTZAN music genre dataset.

Prerequisites

This section assumes the reader has finished Section 20.0.4 (self-supervised audio encoders such as wav2vec 2.0 and HuBERT) and has seen the HuggingFace Trainer fine-tuning pattern from the training-and-adaptation part. Familiarity with contrastive pretraining from CLIP (Chapter 22) helps for the CLAP discussion.

20.0.5.1 The Four Flavors of Audio Classification

Section 20.0 introduced the bipartite taxonomy at the chapter level. This section expands the classification leaf into the four flavors that drive most production deployments.

Content classification assigns broad acoustic categories: music versus speech versus environmental noise, podcast versus advertisement, vocal versus instrumental. The use case is routing: an upstream model decides which downstream pipeline (ASR, music tagger, sound event detector) receives the clip. The standard backbone is AST or a CLAP zero-shot classifier; the label set is small (3 to 10 classes).

Event classification labels short sound events: alarm, fire crackling, glass break, gunfire, doorbell, baby cry. The reference dataset is AudioSet (Gemmeke et al., 2017) with 527 classes drawn from a structured ontology, and the reference backbone is AST pretrained on AudioSet. A smaller domain-specific dataset is ESC-50 (50 environmental classes; Piczak, 2015) which is small enough to fit in memory and is the standard benchmark for evaluating new audio encoders. Production use cases include security monitoring (gunfire and glass break detection), accessibility (alarm transcription for deaf users), and content moderation (detecting inappropriate sounds in user-uploaded video).

Intent classification maps a spoken utterance to a discrete action label. The MINDS-14 dataset (Gerz et al., 2021) is the canonical example: 14 intent classes in a banking domain (pay_bill, transfer, card_issues, address, app_error, ...) across 14 languages. The reference recipe uses a wav2vec 2.0 backbone or its multilingual variant XLS-R, fine-tuned with a classification head. Intent classification is the building block of voice assistants and IVR systems.

Keyword spotting (KWS) detects a small closed vocabulary of trigger words: "stop", "play", "next", "OK Google". Because the vocabulary is bounded (Speech Commands has 35 classes; Warden, 2018) and the audio clips are short (1 second), KWS models can be tiny enough to run continuously on always-on microcontrollers. The reference backbone is AST fine-tuned on Speech Commands; smaller depthwise-separable CNN architectures are used for on-device deployment.

The companion recognition tasks are speaker identification (which person is speaking) and language identification (which language is being spoken). LangID typically uses a Whisper-based head; speaker ID uses WavLM or ECAPA-TDNN (a non-transformer architecture optimized for embedding speakers, popular in production).

20.0.5.2 Pretrained Recipes: One Line Each

For each flavor at least one curated HuggingFace checkpoint exists and the pipeline call is a single line. The reader's job is to pick the right model name; the inference loop is identical.

Library Shortcut: Keyword Spotting on Speech Commands
from datasets import load_dataset
from transformers import pipeline

# Speech Commands dataset; streaming avoids the full 1.5 GB download.
sc = load_dataset("speech_commands", "v0.02", split="validation", streaming=True)
example = next(iter(sc))
print(example["label"], sc.features["label"].names[example["label"]])

# AST fine-tuned on Speech Commands.
kws = pipeline("audio-classification",
               model="MIT/ast-finetuned-speech-commands-v2")
print(kws(example["audio"]["array"], top_k=5))
# [{'score': 0.999, 'label': 'backward'},
#  {'score': 0.0003, 'label': 'happy'},
#  {'score': 0.0002, 'label': 'follow'},
#  {'score': 0.0001, 'label': 'stop'},
#  {'score': 0.0001, 'label': 'up'}]
Code Fragment 20.0.5.1: Keyword spotting in five lines using AST fine-tuned on Speech Commands. The top_k argument requests the $k$ most probable classes with their softmax probabilities. The Speech Commands dataset's v0.02 split has 35 classes plus a _silence_ and _unknown_ bucket.
Library Shortcut: Intent Classification on MINDS-14
from datasets import load_dataset, Audio
from transformers import pipeline

minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
example = minds[0]
print(example["transcription"])
# "I would like to pay my electricity bill using my card can you please assist"

# XLS-R (multilingual wav2vec 2.0) fine-tuned on MINDS-14.
intent = pipeline("audio-classification",
                  model="anton-l/xtreme_s_xlsr_300m_minds14")
print(intent(example["audio"]["array"], top_k=3))
# [{'score': 0.963, 'label': 'pay_bill'},
#  {'score': 0.014, 'label': 'freeze'},
#  {'score': 0.006, 'label': 'card_issues'}]
Code Fragment 20.0.5.2: Intent classification on a MINDS-14 example. The model card anton-l/xtreme_s_xlsr_300m_minds14 is wav2vec 2.0 XLS-R (300M params, multilingual) fine-tuned on the full 14-language MINDS-14 training set with a classification head over the 14 intent labels.
Library Shortcut: Language Identification on FLEURS
from datasets import load_dataset
from transformers import pipeline

fleurs = load_dataset("google/fleurs", "all", split="validation", streaming=True)
example = next(iter(fleurs))

# Whisper fine-tuned for language identification on FLEURS.
langid = pipeline("audio-classification",
                  model="sanchit-gandhi/whisper-medium-fleurs-lang-id")
print(langid(example["audio"]["array"], top_k=3))
# [{'score': 0.999, 'label': 'Afrikaans'},
#  {'score': 0.0001, 'label': 'Northern-Sotho'},
#  {'score': 0.00005, 'label': 'Icelandic'}]
Code Fragment 20.0.5.3: Language identification on a FLEURS sample using a Whisper-derived classifier. The model takes the Whisper encoder, drops the autoregressive text decoder, and adds a 102-class classification head over the encoder's pooled output. FLEURS covers 102 languages including many low-resource ones (Northern-Sotho, Cebuano, Maltese), making it the canonical multilingual LangID benchmark.

20.0.5.3 AST for Content and Event Classification

For audio content and event classification, the default backbone is AST fine-tuned on AudioSet (the 527-class checkpoint introduced in Section 20.0.3.3). The inference recipe at the lower-level processor + model API exposes a useful trick: the id2label mapping on the model config lets the reader convert the integer prediction back to a human-readable class.

Hands-On: AST AudioSet Classification with Lower-Level API

Steps

from datasets import load_dataset, Audio
from transformers import AutoFeatureExtractor, ASTForAudioClassification
import torch

# The release name embeds the AudioSet mAP: 10x10 patches, 0.4593 mean AP.
checkpoint = "MIT/ast-finetuned-audioset-10-10-0.4593"
extractor = AutoFeatureExtractor.from_pretrained(checkpoint)
model = ASTForAudioClassification.from_pretrained(checkpoint).eval()

ds = load_dataset("ashraq/esc50", split="train")
ds = ds.cast_column("audio", Audio(sampling_rate=extractor.sampling_rate))
example = ds[0]

inputs = extractor(example["audio"]["array"],
                   sampling_rate=extractor.sampling_rate,
                   return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

predicted_id = torch.argmax(logits, dim=-1).item()
print(f"Predicted AudioSet class: {model.config.id2label[predicted_id]}")
# Predicted AudioSet class: Dog

# AudioSet is multi-label (up to multiple events per clip), so often you want
# all classes above a threshold rather than just argmax:
probs = torch.sigmoid(logits)[0]
top_indices = torch.topk(probs, k=5).indices
for i in top_indices:
    print(f"  {probs[i].item():.3f}  {model.config.id2label[i.item()]}")
Code Fragment 20.0.5.4: AST inference on an ESC-50 sample using the lower-level processor + model API. The id2label map on the model config translates integer class indices to the AudioSet ontology's human-readable names (Speech, Music, Dog, Glass, Alarm, etc.). For multi-label tasks, use sigmoid per class rather than softmax across classes, and pick a threshold (commonly 0.5) per class.
Two friendly cartoon avatars stand in a shared room shaking hands, one wearing large headphones with floating musical notes representing audio and the other holding a small text label sign with simple symbols, surrounded by a soft halo of light to show the shared embedding space they meet in
CLAP teaches audio and text to shake hands in the same room, so the next time you describe a sound in plain English it already knows where to stand.

20.0.5.4 CLAP: Zero-Shot Audio Classification

The pretrained checkpoints above cover hundreds of classes, but real downstream tasks often need labels that no checkpoint was trained on: "is this a vacuum cleaner?", "does this clip contain a baby crying?", "is this the sound of running water?". CLAP (Contrastive Language-Audio Pretraining; Wu et al., 2023) solves this with the same trick CLIP uses for images: learn a shared embedding space for audio and text such that the audio embedding and its description's text embedding land close together. At inference, embed the audio plus a list of candidate text descriptions ("Sound of a dog", "Sound of vacuum cleaner") and pick the one with the highest cosine similarity. No task-specific fine-tuning required.

Architecture

CLAP has two encoders.

The audio encoder is typically HTSAT (Hierarchical Token-Semantic Audio Transformer) or PaSST, both transformer architectures that consume log-mel spectrograms. The output is a single fixed-length embedding per clip (after mean pooling) projected to a shared embedding dimension (typically 512 or 1024).

The text encoder is a BERT-family transformer (RoBERTa-base in the LAION release) that takes the candidate text description and produces a matching fixed-length embedding in the same shared space.

Training: Symmetric InfoNCE

Training pairs are (audio clip, text description). Common sources include LAION-Audio-630K (the largest open audio-text dataset, with ~630,000 clips), AudioCaps (audio with crowdsourced captions), WavCaps (audio with web-scraped captions), and FreeSound (Creative Commons audio with user-supplied tags). Each minibatch of $N$ pairs $(a_i, t_i)$ gets two contrastive losses, one in each direction:

$$\mathcal{L}_{\mathrm{CLAP}} = -\frac{1}{2N} \sum_{i=1}^{N} \!\left[ \log \frac{\exp(\mathrm{sim}(a_i, t_i) / \tau)}{\sum_{j=1}^{N} \exp(\mathrm{sim}(a_i, t_j) / \tau)} + \log \frac{\exp(\mathrm{sim}(t_i, a_i) / \tau)}{\sum_{j=1}^{N} \exp(\mathrm{sim}(t_i, a_j) / \tau)} \right].$$

The first term is audio-to-text matching: out of the $N$ candidate texts in the batch, the model should pick the true one given the audio. The second term is text-to-audio: out of the $N$ candidate audios, pick the true one given the text. The temperature $\tau$ (a learned scalar, typically around 0.07) sharpens the softmax. The form is identical to CLIP's contrastive loss with audio replacing image.

Two Engineering Tricks That Matter

CLAP would not work without two domain-specific tricks that bridge the audio-text gap.

Variable-length audio with chunk-and-fuse. Audio clips vary from one second to several minutes; a single fixed-size patch embedding loses information for long clips. CLAP's audio encoder handles this with a multi-branch input. For clips longer than 10 seconds, the model samples three random 10-second chunks, computes a fourth down-sampled global representation (the full clip resampled to fit a 10-second window), and feeds all four through the audio encoder. The four embeddings are merged through an attention feature fusion module that learns a weighted combination. For clips shorter than 10 seconds, the input is simply repeated and zero-padded to 10 seconds, and the chunk-and-fuse module degenerates to processing the same content four times.

Caption-versus-keyword text handling with T5 keyword-to-sentence augmentation. CLAP's training data mixes two text formats. Some examples have full sentence captions ("A dog is barking in the distance while a car passes by"); others have only keyword labels ("dog bark", "car"). Embedding keywords directly works poorly because the text encoder was pretrained on full sentences. CLAP routes keyword-only examples through a small T5 model fine-tuned for keyword-to-sentence augmentation: "dog bark" becomes "The sound of a dog barking." This synthesized sentence is then fed to the text encoder. The augmentation does not need to be perfect; it just needs to bring the keyword text into the same distribution the text encoder expects.

AudioCLIP: The Related Early Model

Guzhov et al.'s AudioCLIP (Guzhov et al., 2022) was an earlier attempt at the same idea, extending OpenAI's CLIP with an audio branch trained on AudioSet labels treated as text descriptions. AudioCLIP supports three modalities (image, text, audio) and three pairwise alignment directions, useful for cross-modal retrieval. CLAP came after with cleaner audio-text-only training and stronger downstream performance, but AudioCLIP remains the relevant reference for trimodal audio-image-text retrieval scenarios.

Library Shortcut: CLAP Zero-Shot Audio Classification
from datasets import load_dataset
from transformers import pipeline

esc50 = load_dataset("ashraq/esc50", split="train")
example = esc50[0]

# laion/clap-htsat-unfused is the open-source CLAP release.
zs = pipeline(task="zero-shot-audio-classification",
              model="laion/clap-htsat-unfused")

# Define the candidate label set as natural-language phrases.
labels = [
    "Sound of a dog barking",
    "Sound of a vacuum cleaner",
    "Sound of rain on a window",
    "Sound of a baby crying",
]
result = zs(example["audio"]["array"], candidate_labels=labels)
for r in result:
    print(f"  {r['score']:.3f}  {r['label']}")
# Picks "Sound of a dog barking" with the highest score on a dog clip.
Code Fragment 20.0.5.5: CLAP zero-shot audio classification with the laion/clap-htsat-unfused checkpoint. The candidate labels are arbitrary natural-language phrases, not a fixed taxonomy: CLAP can score "Sound of a vintage tube guitar amplifier" as easily as "Sound of a dog". The pipeline returns one score per candidate, sorted by descending probability. This is the audio analogue of CLIP's zero-shot image classification.

20.0.5.5 Supervised Fine-Tuning: DistilHuBERT on GTZAN

When neither a curated checkpoint nor CLAP's zero-shot capability fits the task, the next step is supervised fine-tuning on labeled data. The canonical recipe uses a small pretrained encoder (DistilHuBERT from Section 20.0.4), the HuggingFace Trainer, and a few hundred to a few thousand labeled examples. The textbook example is music genre classification on GTZAN (Tzanetakis & Cook, 2002): 1000 thirty-second clips across 10 genres (blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, rock), the de-facto benchmark for music genre classifiers since 2002.

Hands-On: GTZAN Genre Classifier in Sixty Lines

Steps

import numpy as np
import evaluate
from datasets import load_dataset, Audio
from transformers import (
    AutoFeatureExtractor, AutoModelForAudioClassification,
    TrainingArguments, Trainer,
)

# 1. Load and split GTZAN. 999 records (one rock file is corrupt and dropped).
gtzan = load_dataset("marsyas/gtzan", "all")
gtzan = gtzan["train"].train_test_split(seed=42, shuffle=True, test_size=0.1)
print(gtzan)
# DatasetDict({
#     train: Dataset({ num_rows: 899, features: ['file', 'audio', 'genre'] })
#     test:  Dataset({ num_rows: 100, features: ['file', 'audio', 'genre'] })
# })

# Map integer label IDs to human-readable names (and back).
id2label = {i: gtzan["train"].features["genre"].int2str(i) for i in range(10)}
label2id = {v: k for k, v in id2label.items()}
num_labels = len(id2label)

# 2. Feature extractor (preprocesses raw audio for the model).
model_id = "ntu-spml/distilhubert"
feat = AutoFeatureExtractor.from_pretrained(
    model_id, do_normalize=True, return_attention_mask=True,
)

# Resample to the rate DistilHuBERT expects (16 kHz).
gtzan = gtzan.cast_column("audio", Audio(sampling_rate=feat.sampling_rate))

MAX_DURATION_S = 30.0
def preprocess(batch):
    arrays = [x["array"] for x in batch["audio"]]
    inputs = feat(
        arrays, sampling_rate=feat.sampling_rate,
        max_length=int(feat.sampling_rate * MAX_DURATION_S),
        truncation=True, return_attention_mask=True,
    )
    return inputs

gtzan_encoded = gtzan.map(preprocess, batched=True, batch_size=32,
                          remove_columns=["audio", "file"])
gtzan_encoded = gtzan_encoded.rename_column("genre", "label")

# 3. Model with a 10-class classification head on top of DistilHuBERT.
model = AutoModelForAudioClassification.from_pretrained(
    model_id, num_labels=num_labels, label2id=label2id, id2label=id2label,
)

# 4. Metric: top-1 accuracy.
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=eval_pred.label_ids)

# 5. Trainer.
training_args = TrainingArguments(
    output_dir="distilhubert-gtzan",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,         # effective batch size 32
    per_device_eval_batch_size=8,
    num_train_epochs=10,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=False,
    fp16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=gtzan_encoded["train"],
    eval_dataset=gtzan_encoded["test"],
    tokenizer=feat,                  # feature extractor doubles as collator
    compute_metrics=compute_metrics,
)
trainer.train()
# Expect ~85% test accuracy after 10 epochs on a single A100 (about 30 min).
Code Fragment 20.0.5.6: End-to-end DistilHuBERT fine-tune on GTZAN. The pattern is: load and resample the dataset, build the feature extractor (which doubles as the collator in HuggingFace's audio pipeline), instantiate the model with an automatically-constructed classification head sized to num_labels, define an accuracy metric, configure TrainingArguments, and call trainer.train(). The same template works for any closed-vocabulary audio classification task by swapping the dataset, the num_labels, and the id2label map; the rest of the code is verbatim reusable. Reader who wants to push the resulting checkpoint to the Hub should set push_to_hub=True after logging in with huggingface-cli login.

A few production notes that the slide does not cover. First, DistilHuBERT is preferred over the full HuBERT base for this kind of demo because it fine-tunes about 2x faster with minimal accuracy loss; for production deployments where the last 2 to 3 accuracy points matter, swap to facebook/hubert-base-ls960 or larger. Second, gradient_accumulation_steps=4 simulates a batch size of 32 on a GPU that can only fit 8 samples in memory; reader with more VRAM can drop this and increase the per-device batch size. Third, the fp16=True flag enables mixed-precision training, which is essential on consumer GPUs (RTX 4090 and below) to fit the model with reasonable batch sizes.

See Also

Reader who wants a full audio fine-tuning bootcamp should work through the HuggingFace Audio Course Chapter 4 ("A Genre Classifier") and Chapter 7 ("Hands-on exercises"). The notebook in those chapters covers the same DistilHuBERT-on-GTZAN recipe in more depth, with data augmentation (SpecAugment frequency and time masking), learning rate scheduling, and Weights & Biases logging. The recipe in Code Fragment 20.0.5.6 is the minimum viable version.

20.0.5.6 Which Recipe Should the Reader Use?

A short decision tree for any new audio classification task:

  1. Does a curated HuggingFace checkpoint already cover the task? (Search the Hub for the dataset name, e.g., speech_commands, minds14, fleurs, audioset, esc50, gtzan.) If yes, the pipeline one-liner from Section 20.0.5.2 is the entire solution.
  2. Are the target labels arbitrary text descriptions, with no labeled training data? Use CLAP zero-shot (Code Fragment 20.0.5.5). Expect 60 to 80% accuracy on standard benchmarks, lower if the candidate labels are very fine-grained or domain-specific.
  3. Is there a moderate amount of labeled data (a few hundred to a few thousand examples) in a closed vocabulary? Fine-tune DistilHuBERT or AST (Code Fragment 20.0.5.6). Expect 80 to 95% accuracy depending on task difficulty and class imbalance.
  4. Is the task domain-specific with hundreds of thousands of labeled examples? Train from scratch or fine-tune a larger backbone (HuBERT large, WavLM large, BEATs) end-to-end with a longer schedule. This is the regime where the model gap between SSL backbones matters; Section 20.0.4's cheat-sheet picks the right one.
Warning: GTZAN Has Known Label Errors

GTZAN is the genre classification benchmark, but it is also known to have hundreds of mislabeled examples and several distortions, repetitions, and recording-quality artifacts that make some classes trivially distinguishable for the wrong reasons (Sturm, 2014). A model reaching 95% accuracy on GTZAN is partly exploiting those artifacts. For research that requires a clean evaluation, the FMA (Free Music Archive, Defferrard et al., 2017) and MagnaTagATune datasets are better choices. GTZAN remains the right pedagogical example because it is small, well-known, and fits on a laptop.

Fun Note: The CLAP Label That Worked Too Well

I once tested a CLAP zero-shot classifier with the candidate label "Sound of someone definitely not lying about their expense report." It returned 0.04 on a normal office recording and 0.79 on an audiobook chapter where a character literally narrated a lie. CLAP does not understand pragmatics, but it understands the dictionary, and somewhere in the LAION-Audio-630K training set there were enough storytelling clips to make "lying" a real text concept. Now I treat CLAP's confidence on weird labels as a Rorschach test for the training distribution. An AI Model Who Reads Too Much Into Candidate Labels

Key Insight

Audio classification has four practical flavors: content, events, intent, keyword spotting, plus the related recognition tasks of speaker and language ID. Each flavor has a curated HuggingFace checkpoint that reduces inference to a one-line pipeline call (AST for events and KWS, XLS-R for intent, Whisper-derived heads for LangID). When no curated checkpoint fits, CLAP enables zero-shot classification of arbitrary text-described classes via contrastive language-audio pretraining with symmetric InfoNCE loss, chunk-and-fuse audio encoding, and T5 keyword-to-sentence text augmentation. When a closed vocabulary needs custom labels, fine-tune DistilHuBERT (or a larger backbone) with HuggingFace Trainer in roughly 60 lines: load the dataset, resample to 16 kHz, build the feature extractor, instantiate AutoModelForAudioClassification, define an accuracy metric, and train.

Exercise 20.0.5.1: Zero-Shot Audio Classification with CLAP

Objective. Classify environmental sounds without any task-specific training, using CLAP's text-audio joint embedding.

Task. Load laion/larger_clap_general with ClapModel.from_pretrained and ClapProcessor.from_pretrained. Pick five ESC-50 clips from five different classes (one each). Define a candidate label list of 10 text descriptions (5 correct + 5 plausible distractors phrased as "the sound of X"). For each clip, encode the audio and text candidates, compute cosine similarity, and report the top-1 predicted label. Measure accuracy across the five clips.

Stretch. Repeat with two phrasings of the same concept, e.g. "a dog barking" vs "the sound of a dog". Quantify how much the phrasing affects the top-1 score. This is the practical face of prompt engineering for CLAP.

What Comes Next

This concludes the foundational sub-section block (20.0 through 20.0.5). The reader now has the data representations (20.0.1), codec tokens (20.0.2), transformer architectures (20.0.3), self-supervised encoders (20.0.4), and classification recipes (20.0.5) that the rest of Chapter 20 takes for granted. The chapter now returns to its original generation-focused arc: Section 20.1 on text-to-speech (VITS, Bark, F5-TTS), Section 20.2 on voice cloning, Section 20.3 on music generation, Section 20.4 on audio editing, and Section 20.5 on the production Whisper deep dive. After that, Sections 20.6 through 20.10 pivot to video.

Further Reading
Wu, Y. et al. (2023). "Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation (CLAP)." ICASSP 2023. The CLAP paper used in Code Fragment 20.0.5.5; the chunk-and-fuse and T5 augmentation tricks come from this work.
Guzhov, A. et al. (2022). "AudioCLIP: Extending CLIP to Image, Text and Audio." ICASSP 2022. The earlier trimodal extension of CLIP referenced in Section 20.0.5.4.
Piczak, K. J. (2015). "ESC: Dataset for Environmental Sound Classification." ACM Multimedia 2015. The 50-class environmental sound benchmark used in Code Fragment 20.0.5.4.
Tzanetakis, G. and Cook, P. (2002). "Musical Genre Classification of Audio Signals." IEEE Transactions on Speech and Audio Processing, 10(5), 293-302. The original GTZAN paper that established the 10-genre benchmark used in the Section 20.0.5.5 fine-tuning lab.
HuggingFace Audio Course, Chapter 4 (2024). "Pre-trained models for audio classification." The companion course that extends Code Fragment 20.0.5.6 with data augmentation, learning rate scheduling, and Weights & Biases logging.
Sturm, B. L. (2014). "The State of the Art Ten Years After a State of the Art: Future Research in Music Information Retrieval." Journal of New Music Research, 43(2), 147-172. The reference for GTZAN's known label errors and artifacts; cited in the Section 20.0.5.6 warning.