The transformers library is the central hub of the HuggingFace ecosystem. It provides a unified API for loading, configuring, and running inference with thousands of pretrained models spanning text, vision, audio, and multimodal tasks. This section walks through the three layers of abstraction the library offers: high-level pipelines for quick prototyping, AutoClasses for flexible model loading, and direct model/tokenizer access for full control.
1. The Pipeline API: Inference in One Line
The fastest way to use a pretrained model is through the pipeline() function. Pipelines bundle a model, a tokenizer, and task-specific pre/post-processing into a single callable object. You specify a task name, and the library selects a suitable default model from the Hub.
The following example creates pipelines for three common NLP tasks: sentiment analysis, named entity recognition, and text generation.
from transformers import pipeline
# Sentiment analysis (default: distilbert-base-uncased-finetuned-sst-2-english)
classifier = pipeline("sentiment-analysis")
result = classifier("HuggingFace makes NLP accessible to everyone.")
print(result)
# [{'label': 'POSITIVE', 'score': 0.9998}]
# Named entity recognition
ner = pipeline("ner", aggregation_strategy="simple")
entities = ner("Yann LeCun works at Meta in New York.")
for ent in entities:
print(f" {ent['word']:<15} {ent['entity_group']:<10} {ent['score']:.3f}")
# Text generation with a specific model
generator = pipeline("text-generation", model="gpt2", max_new_tokens=40)
output = generator("The future of AI is", do_sample=True, temperature=0.7)
print(output[0]["generated_text"])
pipeline() calls covering sentiment analysis, NER with entity aggregation, and text generation with a specified model. Each pipeline handles tokenization, inference, and output formatting internally, making single-line inference possible for 30+ task types.Pipelines support over 30 task types including question-answering, summarization, translation, zero-shot-classification, image-classification, and automatic-speech-recognition. Each task maps to a specific pipeline class that handles the input/output formatting appropriate for that task.
By default, pipelines run on CPU. Pass device=0 to place the model on the first GPU, or use device="cuda" for automatic GPU selection. For Apple Silicon, use device="mps". Starting with Transformers v4.36, you can also pass device_map="auto" to let the library distribute a large model across multiple GPUs automatically.
2. AutoClasses: Flexible Model and Tokenizer Loading
When you need more control than pipelines offer, AutoClasses provide the next level of abstraction. The two most important are AutoTokenizer and AutoModel (plus its task-specific variants). These classes inspect a model's configuration on the Hub and instantiate the correct architecture automatically.
The code below loads a tokenizer and a sequence classification model, then runs a forward pass to obtain logits.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
# Load tokenizer and model from the Hub
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Tokenize input
text = "This library is incredibly well designed."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
# Forward pass
with torch.no_grad():
outputs = model(**inputs)
# Convert logits to probabilities
probs = torch.softmax(outputs.logits, dim=-1)
labels = model.config.id2label
for idx, prob in enumerate(probs[0]):
print(f" {labels[idx]}: {prob:.4f}")
AutoTokenizer and AutoModelForSequenceClassification. Unlike pipelines, this approach gives direct access to raw logits, enabling custom post-processing such as calibration or thresholding. The id2label mapping from the config converts indices to human-readable class names.The AutoModelFor* family includes task-specific heads. The most commonly used variants are listed in Figure K.1.1.
| AutoClass | Task | Output |
|---|---|---|
AutoModelForCausalLM | Text generation (decoder-only) | Next-token logits |
AutoModelForSeq2SeqLM | Translation, summarization (encoder-decoder) | Sequence logits |
AutoModelForSequenceClassification | Sentiment, NLI, topic classification | Class logits |
AutoModelForTokenClassification | NER, POS tagging | Per-token logits |
AutoModelForQuestionAnswering | Extractive QA | Start/end logits |
AutoModelForMaskedLM | Fill-mask (encoder-only) | Vocabulary logits |
3. Model Architectures: Encoder, Decoder, and Encoder-Decoder
Transformer models fall into three architectural families, each suited to different tasks. Understanding which architecture a model uses is essential for selecting the correct AutoClass and configuring inputs properly. For a deeper treatment of these architectures, see Chapter 3: Transformer Architecture.
Encoder-only models (BERT, RoBERTa, DeBERTa) process the full input bidirectionally and produce contextualized representations. They excel at classification, NER, and extractive QA. Use AutoModel or task-specific heads like AutoModelForSequenceClassification.
Decoder-only models (GPT-2, LLaMA, Mistral, Falcon) generate text autoregressively, attending only to preceding tokens. They are the foundation of modern conversational AI. Use AutoModelForCausalLM.
Encoder-decoder models (T5, BART, mBART) encode an input sequence and then decode an output sequence. They are well suited for translation, summarization, and any task with a clear input-to-output mapping. Use AutoModelForSeq2SeqLM.
The following example demonstrates loading one model from each family.
from transformers import (
AutoModel,
AutoModelForCausalLM,
AutoModelForSeq2SeqLM,
AutoTokenizer,
)
# Encoder-only: BERT
enc_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
enc_model = AutoModel.from_pretrained("bert-base-uncased")
enc_out = enc_model(**enc_tokenizer("Hello world", return_tensors="pt"))
print(f"Encoder hidden states shape: {enc_out.last_hidden_state.shape}")
# Decoder-only: GPT-2
dec_tokenizer = AutoTokenizer.from_pretrained("gpt2")
dec_model = AutoModelForCausalLM.from_pretrained("gpt2")
dec_out = dec_model.generate(
**dec_tokenizer("Once upon a time", return_tensors="pt"),
max_new_tokens=20,
do_sample=True,
)
print(f"Generated: {dec_tokenizer.decode(dec_out[0], skip_special_tokens=True)}")
# Encoder-decoder: T5
s2s_tokenizer = AutoTokenizer.from_pretrained("t5-small")
s2s_model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
s2s_out = s2s_model.generate(
**s2s_tokenizer("translate English to French: Hello, how are you?", return_tensors="pt"),
max_new_tokens=30,
)
print(f"Translation: {s2s_tokenizer.decode(s2s_out[0], skip_special_tokens=True)}")
AutoModel variant must match the architecture type.4. Model Configuration and Customization
Every model has an associated configuration object (AutoConfig) that stores architectural hyperparameters such as the number of layers, hidden size, number of attention heads, and vocabulary size. You can inspect or modify configuration before instantiating a model.
This example loads a configuration, modifies it, and creates a randomly initialized model with the new settings.
from transformers import AutoConfig, AutoModelForCausalLM
# Load existing config
config = AutoConfig.from_pretrained("gpt2")
print(f"Original: {config.n_layer} layers, {config.n_head} heads, "
f"hidden size {config.n_embd}")
# Create a smaller variant for experimentation
config.n_layer = 4
config.n_head = 4
config.n_embd = 256
# Instantiate a randomly initialized model with modified config
small_model = AutoModelForCausalLM.from_config(config)
num_params = sum(p.numel() for p in small_model.parameters())
print(f"Custom model: {num_params / 1e6:.1f}M parameters")
AutoConfig. Reducing GPT-2 from 12 layers to 4 and the hidden size from 768 to 256 produces an 11.2M parameter model suitable for rapid experimentation. Note that from_config() creates randomly initialized weights.from_config() creates a model with random weights. This is useful for architecture experiments or training from scratch, but not for inference. Always use from_pretrained() when you need a model with learned weights.
5. Efficient Loading and Precision Control
Modern LLMs can be extremely large. The Transformers library provides several mechanisms for loading models efficiently, including reduced-precision formats, quantization, and memory-mapped loading. These techniques are essential for working with billion-parameter models on consumer hardware.
The following example shows how to load a large model with reduced precision and automatic device mapping.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "mistralai/Mistral-7B-Instruct-v0.3"
# Load in float16 with automatic device mapping across GPUs
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16, # Half precision (saves ~50% memory)
device_map="auto", # Distribute across available devices
low_cpu_mem_usage=True, # Avoid peak memory during loading
)
# For even smaller memory footprint, use 4-bit quantization
model_4bit = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_4bit=True, # Requires bitsandbytes
bnb_4bit_compute_dtype=torch.float16,
device_map="auto",
)
print(f"FP16 memory: ~14 GB")
print(f"4-bit memory: ~4 GB")
device_map="auto" for multi-GPU distribution, then in 4-bit NF4 quantization for consumer GPUs. The low_cpu_mem_usage=True flag avoids a temporary peak where both the full model and the shard coexist in CPU memory.| Precision | Bits per Parameter | 7B Model Size | Use Case |
|---|---|---|---|
| float32 | 32 | ~28 GB | Training (full precision) |
| float16 / bfloat16 | 16 | ~14 GB | Inference, mixed-precision training |
| int8 | 8 | ~7 GB | Inference with minimal quality loss |
| int4 (NF4) | 4 | ~4 GB | Inference on consumer GPUs; QLoRA fine-tuning |
6. Inference Patterns and Generation Strategies
For causal language models, the generate() method provides a rich set of decoding strategies. Understanding these strategies is critical for controlling the quality, diversity, and determinism of generated text. For a thorough discussion of sampling methods, see Chapter 7: Text Generation and Decoding.
The example below demonstrates several generation strategies on the same prompt.
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
input_ids = tokenizer("The key to good AI is", return_tensors="pt").input_ids
# Greedy decoding: deterministic, often repetitive
greedy = model.generate(input_ids, max_new_tokens=30, do_sample=False)
print("Greedy:", tokenizer.decode(greedy[0], skip_special_tokens=True))
# Top-k sampling: sample from the top 50 tokens at each step
topk = model.generate(input_ids, max_new_tokens=30, do_sample=True, top_k=50)
print("Top-k:", tokenizer.decode(topk[0], skip_special_tokens=True))
# Nucleus (top-p) sampling: sample from smallest set whose cumulative
# probability exceeds p
topp = model.generate(input_ids, max_new_tokens=30, do_sample=True, top_p=0.92)
print("Top-p:", tokenizer.decode(topp[0], skip_special_tokens=True))
# Beam search: explore multiple hypotheses in parallel
beam = model.generate(
input_ids,
max_new_tokens=30,
num_beams=5,
no_repeat_ngram_size=2,
early_stopping=True,
)
print("Beam:", tokenizer.decode(beam[0], skip_special_tokens=True))
no_repeat_ngram_size=2 preventing verbatim repetition.For factual and deterministic outputs (code generation, structured extraction), use greedy decoding or beam search. For creative and diverse outputs (story writing, brainstorming), use nucleus sampling with top_p between 0.9 and 0.95 and temperature between 0.7 and 1.0. For conversational agents, nucleus sampling with moderate temperature (0.6 to 0.8) typically gives the best balance of coherence and variety.