Building Conversational AI with LLMs and Agents
Appendix K: HuggingFace: Transformers, Datasets, and Hub

Transformers Library: Models, Pipelines, and AutoClasses

Big Picture

The transformers library is the central hub of the HuggingFace ecosystem. It provides a unified API for loading, configuring, and running inference with thousands of pretrained models spanning text, vision, audio, and multimodal tasks. This section walks through the three layers of abstraction the library offers: high-level pipelines for quick prototyping, AutoClasses for flexible model loading, and direct model/tokenizer access for full control.

1. The Pipeline API: Inference in One Line

The fastest way to use a pretrained model is through the pipeline() function. Pipelines bundle a model, a tokenizer, and task-specific pre/post-processing into a single callable object. You specify a task name, and the library selects a suitable default model from the Hub.

The following example creates pipelines for three common NLP tasks: sentiment analysis, named entity recognition, and text generation.

from transformers import pipeline

# Sentiment analysis (default: distilbert-base-uncased-finetuned-sst-2-english)
classifier = pipeline("sentiment-analysis")
result = classifier("HuggingFace makes NLP accessible to everyone.")
print(result)
# [{'label': 'POSITIVE', 'score': 0.9998}]

# Named entity recognition
ner = pipeline("ner", aggregation_strategy="simple")
entities = ner("Yann LeCun works at Meta in New York.")
for ent in entities:
    print(f"  {ent['word']:<15} {ent['entity_group']:<10} {ent['score']:.3f}")

# Text generation with a specific model
generator = pipeline("text-generation", model="gpt2", max_new_tokens=40)
output = generator("The future of AI is", do_sample=True, temperature=0.7)
print(output[0]["generated_text"])
[{'label': 'POSITIVE', 'score': 0.9998}] Yann LeCun PER 0.998 Meta ORG 0.994 New York LOC 0.997 The future of AI is likely to be shaped by a combination of advances in hardware, algorithms, and data availability that together...
Code Fragment 1: Three pipeline() calls covering sentiment analysis, NER with entity aggregation, and text generation with a specified model. Each pipeline handles tokenization, inference, and output formatting internally, making single-line inference possible for 30+ task types.

Pipelines support over 30 task types including question-answering, summarization, translation, zero-shot-classification, image-classification, and automatic-speech-recognition. Each task maps to a specific pipeline class that handles the input/output formatting appropriate for that task.

Pipeline Device Placement

By default, pipelines run on CPU. Pass device=0 to place the model on the first GPU, or use device="cuda" for automatic GPU selection. For Apple Silicon, use device="mps". Starting with Transformers v4.36, you can also pass device_map="auto" to let the library distribute a large model across multiple GPUs automatically.

2. AutoClasses: Flexible Model and Tokenizer Loading

When you need more control than pipelines offer, AutoClasses provide the next level of abstraction. The two most important are AutoTokenizer and AutoModel (plus its task-specific variants). These classes inspect a model's configuration on the Hub and instantiate the correct architecture automatically.

The code below loads a tokenizer and a sequence classification model, then runs a forward pass to obtain logits.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# Load tokenizer and model from the Hub
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Tokenize input
text = "This library is incredibly well designed."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

# Forward pass
with torch.no_grad():
    outputs = model(**inputs)

# Convert logits to probabilities
probs = torch.softmax(outputs.logits, dim=-1)
labels = model.config.id2label
for idx, prob in enumerate(probs[0]):
    print(f"  {labels[idx]}: {prob:.4f}")
NEGATIVE: 0.0003 POSITIVE: 0.9997
Code Fragment 2: Manual inference with AutoTokenizer and AutoModelForSequenceClassification. Unlike pipelines, this approach gives direct access to raw logits, enabling custom post-processing such as calibration or thresholding. The id2label mapping from the config converts indices to human-readable class names.

The AutoModelFor* family includes task-specific heads. The most commonly used variants are listed in Figure K.1.1.

AutoClassTaskOutput
AutoModelForCausalLMText generation (decoder-only)Next-token logits
AutoModelForSeq2SeqLMTranslation, summarization (encoder-decoder)Sequence logits
AutoModelForSequenceClassificationSentiment, NLI, topic classificationClass logits
AutoModelForTokenClassificationNER, POS taggingPer-token logits
AutoModelForQuestionAnsweringExtractive QAStart/end logits
AutoModelForMaskedLMFill-mask (encoder-only)Vocabulary logits
Figure K.1.1: Common AutoModel task-specific classes and their outputs.

3. Model Architectures: Encoder, Decoder, and Encoder-Decoder

Transformer models fall into three architectural families, each suited to different tasks. Understanding which architecture a model uses is essential for selecting the correct AutoClass and configuring inputs properly. For a deeper treatment of these architectures, see Chapter 3: Transformer Architecture.

Encoder-only models (BERT, RoBERTa, DeBERTa) process the full input bidirectionally and produce contextualized representations. They excel at classification, NER, and extractive QA. Use AutoModel or task-specific heads like AutoModelForSequenceClassification.

Decoder-only models (GPT-2, LLaMA, Mistral, Falcon) generate text autoregressively, attending only to preceding tokens. They are the foundation of modern conversational AI. Use AutoModelForCausalLM.

Encoder-decoder models (T5, BART, mBART) encode an input sequence and then decode an output sequence. They are well suited for translation, summarization, and any task with a clear input-to-output mapping. Use AutoModelForSeq2SeqLM.

The following example demonstrates loading one model from each family.

from transformers import (
    AutoModel,
    AutoModelForCausalLM,
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
)

# Encoder-only: BERT
enc_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
enc_model = AutoModel.from_pretrained("bert-base-uncased")
enc_out = enc_model(**enc_tokenizer("Hello world", return_tensors="pt"))
print(f"Encoder hidden states shape: {enc_out.last_hidden_state.shape}")

# Decoder-only: GPT-2
dec_tokenizer = AutoTokenizer.from_pretrained("gpt2")
dec_model = AutoModelForCausalLM.from_pretrained("gpt2")
dec_out = dec_model.generate(
    **dec_tokenizer("Once upon a time", return_tensors="pt"),
    max_new_tokens=20,
    do_sample=True,
)
print(f"Generated: {dec_tokenizer.decode(dec_out[0], skip_special_tokens=True)}")

# Encoder-decoder: T5
s2s_tokenizer = AutoTokenizer.from_pretrained("t5-small")
s2s_model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
s2s_out = s2s_model.generate(
    **s2s_tokenizer("translate English to French: Hello, how are you?", return_tensors="pt"),
    max_new_tokens=30,
)
print(f"Translation: {s2s_tokenizer.decode(s2s_out[0], skip_special_tokens=True)}")
Encoder hidden states shape: torch.Size([1, 4, 768]) Generated: Once upon a time, there was a young girl who lived in a small village... Translation: Bonjour, comment allez-vous?
Code Fragment 3: Loading one model from each transformer family. BERT (encoder-only) produces hidden states for each token, GPT-2 (decoder-only) generates text autoregressively, and T5 (encoder-decoder) translates between languages. The AutoModel variant must match the architecture type.

4. Model Configuration and Customization

Every model has an associated configuration object (AutoConfig) that stores architectural hyperparameters such as the number of layers, hidden size, number of attention heads, and vocabulary size. You can inspect or modify configuration before instantiating a model.

This example loads a configuration, modifies it, and creates a randomly initialized model with the new settings.

from transformers import AutoConfig, AutoModelForCausalLM

# Load existing config
config = AutoConfig.from_pretrained("gpt2")
print(f"Original: {config.n_layer} layers, {config.n_head} heads, "
      f"hidden size {config.n_embd}")

# Create a smaller variant for experimentation
config.n_layer = 4
config.n_head = 4
config.n_embd = 256

# Instantiate a randomly initialized model with modified config
small_model = AutoModelForCausalLM.from_config(config)
num_params = sum(p.numel() for p in small_model.parameters())
print(f"Custom model: {num_params / 1e6:.1f}M parameters")
Original: 12 layers, 12 heads, hidden size 768 Custom model: 11.2M parameters
Code Fragment 4: Creating a custom model variant by modifying AutoConfig. Reducing GPT-2 from 12 layers to 4 and the hidden size from 768 to 256 produces an 11.2M parameter model suitable for rapid experimentation. Note that from_config() creates randomly initialized weights.
Random Initialization vs. Pretrained Weights

from_config() creates a model with random weights. This is useful for architecture experiments or training from scratch, but not for inference. Always use from_pretrained() when you need a model with learned weights.

5. Efficient Loading and Precision Control

Modern LLMs can be extremely large. The Transformers library provides several mechanisms for loading models efficiently, including reduced-precision formats, quantization, and memory-mapped loading. These techniques are essential for working with billion-parameter models on consumer hardware.

The following example shows how to load a large model with reduced precision and automatic device mapping.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "mistralai/Mistral-7B-Instruct-v0.3"

# Load in float16 with automatic device mapping across GPUs
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,   # Half precision (saves ~50% memory)
    device_map="auto",           # Distribute across available devices
    low_cpu_mem_usage=True,      # Avoid peak memory during loading
)

# For even smaller memory footprint, use 4-bit quantization
model_4bit = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,                  # Requires bitsandbytes
    bnb_4bit_compute_dtype=torch.float16,
    device_map="auto",
)

print(f"FP16 memory: ~14 GB")
print(f"4-bit memory: ~4 GB")
FP16 memory: ~14 GB 4-bit memory: ~4 GB
Code Fragment 5: Loading Mistral-7B in float16 with device_map="auto" for multi-GPU distribution, then in 4-bit NF4 quantization for consumer GPUs. The low_cpu_mem_usage=True flag avoids a temporary peak where both the full model and the shard coexist in CPU memory.
PrecisionBits per Parameter7B Model SizeUse Case
float3232~28 GBTraining (full precision)
float16 / bfloat1616~14 GBInference, mixed-precision training
int88~7 GBInference with minimal quality loss
int4 (NF4)4~4 GBInference on consumer GPUs; QLoRA fine-tuning
Figure K.1.2: Precision formats and approximate memory requirements for a 7-billion-parameter model.

6. Inference Patterns and Generation Strategies

For causal language models, the generate() method provides a rich set of decoding strategies. Understanding these strategies is critical for controlling the quality, diversity, and determinism of generated text. For a thorough discussion of sampling methods, see Chapter 7: Text Generation and Decoding.

The example below demonstrates several generation strategies on the same prompt.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
input_ids = tokenizer("The key to good AI is", return_tensors="pt").input_ids

# Greedy decoding: deterministic, often repetitive
greedy = model.generate(input_ids, max_new_tokens=30, do_sample=False)
print("Greedy:", tokenizer.decode(greedy[0], skip_special_tokens=True))

# Top-k sampling: sample from the top 50 tokens at each step
topk = model.generate(input_ids, max_new_tokens=30, do_sample=True, top_k=50)
print("Top-k:", tokenizer.decode(topk[0], skip_special_tokens=True))

# Nucleus (top-p) sampling: sample from smallest set whose cumulative
# probability exceeds p
topp = model.generate(input_ids, max_new_tokens=30, do_sample=True, top_p=0.92)
print("Top-p:", tokenizer.decode(topp[0], skip_special_tokens=True))

# Beam search: explore multiple hypotheses in parallel
beam = model.generate(
    input_ids,
    max_new_tokens=30,
    num_beams=5,
    no_repeat_ngram_size=2,
    early_stopping=True,
)
print("Beam:", tokenizer.decode(beam[0], skip_special_tokens=True))
Greedy: The key to good AI is to make sure that the AI is able to learn from its mistakes. The AI is able to... Top-k: The key to good AI is building systems that understand the nuances of human intent and adapt to new situations... Top-p: The key to good AI is not just raw intelligence, but the ability to collaborate with humans in meaningful ways... Beam: The key to good AI is the ability to learn from experience and adapt to new situations in real time.
Code Fragment 6: Four generation strategies on the same prompt. Greedy decoding is deterministic but repetitive. Top-k and top-p (nucleus) sampling introduce diversity by restricting the candidate set at each step. Beam search explores multiple hypotheses in parallel, with no_repeat_ngram_size=2 preventing verbatim repetition.
Choosing a Generation Strategy

For factual and deterministic outputs (code generation, structured extraction), use greedy decoding or beam search. For creative and diverse outputs (story writing, brainstorming), use nucleus sampling with top_p between 0.9 and 0.95 and temperature between 0.7 and 1.0. For conversational agents, nucleus sampling with moderate temperature (0.6 to 0.8) typically gives the best balance of coherence and variety.