Libraries & Frameworks

Section 25.2

"Hugging Face diffusers is the multimodal analog of transformers. They both started as research demos, and they both end as production infrastructure."

PipPip, Diffusers-Native AI Agent
Big Picture

Multimodal libraries (diffusers, transformers, mmaction2, ComfyUI, accelerate, peft) sit between raw GPU code and your LLM agent pipeline. This section catalogs the ones that have consolidated as 2026 standard tooling and tells you when each library earns a dependency line.

Prerequisites

This section assumes the Hugging Face transformers patterns from Section 12.1, the multimodal architectures from Section 22.1, and the diffusion-model fundamentals from Section 19.7.

Key Insight
The multimodal library stack rhymes with the text-LLM library stack

Hugging Face's diffusers is the multimodal analog of transformers from Module 12. The patterns: pipeline classes, model checkpoints from the Hub, schedulers that swap the way samplers swap in autoregressive decoding, ControlNet adapters that load with from_pretrained just as PEFT adapters do. The mental model from Section 14.1 (load model, build pipeline, run inference, swap pieces as needed) carries over. If you have used transformers, you can use diffusers in an afternoon.

Multimodal libraries split into the diffusion frameworks (image and video), the audio libraries (speech and music), and the multimodal-LLM toolkits.

25.2.1 Diffusion frameworks

25.2.2 Audio libraries

25.2.3 Multimodal-LLM toolkits

Note: One Auto-class, every VLM

LLaVA, InternVL, Qwen2-VL, Llama-Vision, Idefics, BLIP-2: all of them load through Hugging Face's AutoModelForVision2Seq (and the closely related AutoModelForVisionText2Text) with the same calling pattern you used for AutoModelForCausalLM in Module 12. The Auto-class family hides the per-architecture quirks behind a single contract: image plus text in, text out. If you can drive a text LM through the AutoModel interface, you can drive every modern VLM through the same interface.

Two 2024-25 fine-tuning toolkits for VLMs are worth flagging up front: mlx-vlm for Apple Silicon and unsloth-vision for single-GPU consumer hardware. Together they cover the "I want to fine-tune a VLM but don't have an H100" case that did not exist as a category before 2024.

What's Next?

In the next section, Section 25.3: Datasets & Benchmarks, we build on the material covered here.

Further Reading

Vision Libraries

Radford, A., Kim, J. W., Hallacy, C., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision" (CLIP). ICML 2021. arXiv:2103.00020. The CLIP paper; the foundation of modern vision-language libraries.
Cherti, M., Beaumont, R., Wightman, R., et al. (2023). "Reproducible scaling laws for contrastive language-image learning" (OpenCLIP). CVPR 2023. arXiv:2212.07143. OpenCLIP: the reference open-source CLIP library.

Diffusion Libraries

Hugging Face (2024). "diffusers Library." huggingface.co/docs/diffusers. Reference diffusion-model library.
Rombach, R., et al. (2022). "Stable Diffusion." CVPR 2022. arXiv:2112.10752. The reference latent-diffusion paper.