Section K.2: Datasets and Tokenizers: Loading, Preprocessing, and Streaming | Building Conversational AI with LLMs and Agents

Big Picture

Before any model can learn, raw data must be loaded, cleaned, tokenized, and batched. HuggingFace provides two complementary libraries for this: datasets for efficient data loading and manipulation, and tokenizers for fast, flexible text tokenization. Together they form the data backbone of the HuggingFace ecosystem, handling everything from small CSV files to trillion-token web corpora. This section covers both libraries in depth, from basic usage to advanced streaming and preprocessing pipelines.

1. Loading Datasets from the Hub

The datasets library provides access to over 100,000 datasets hosted on the HuggingFace Hub through a single function: load_dataset(). Datasets are downloaded, cached locally, and stored in the efficient Apache Arrow format, which enables memory-mapped access and zero-copy reads. For a catalog of important pretraining and evaluation datasets, see Appendix J: Datasets, Benchmarks, and Leaderboards.

The following example loads a popular sentiment classification dataset and inspects its structure.

from datasets import load_dataset

# Load the IMDB movie review dataset
dataset = load_dataset("imdb")
print(dataset)
# DatasetDict({
#     train: Dataset({ features: ['text', 'label'], num_rows: 25000 })
#     test:  Dataset({ features: ['text', 'label'], num_rows: 25000 })
# })

# Inspect a single example
print(dataset["train"][0]["text"][:200])
print(f"Label: {dataset['train'][0]['label']}")  # 0 = negative, 1 = positive

# Access metadata
print(f"Features: {dataset['train'].features}")
print(f"Column names: {dataset['train'].column_names}")

DatasetDict({ train: Dataset({ features: ['text', 'label'], num_rows: 25000 }) test: Dataset({ features: ['text', 'label'], num_rows: 25000 }) }) I rented I AM CURIOUS-YELLOW from my video store because of all the temporary controversy that surrounded it when it was first released... Label: 0 Features: {'text': Value(dtype='string'), 'label': ClassLabel(names=['neg', 'pos'])} Column names: ['text', 'label']

Code Fragment 1: Loading the IMDB dataset from the Hub with load_dataset(). The returned DatasetDict contains train and test splits, each with text and label columns. The .features property reveals column types including ClassLabel with named categories.

You can also load datasets from local files in common formats. The library supports CSV, JSON, JSON Lines, Parquet, text files, and even pandas DataFrames.

from datasets import load_dataset

# From local CSV
csv_data = load_dataset("csv", data_files="reviews.csv")

# From local JSON Lines
jsonl_data = load_dataset("json", data_files="corpus.jsonl")

# From a directory of Parquet files
parquet_data = load_dataset("parquet", data_files="data/*.parquet")

# From a pandas DataFrame
import pandas as pd
df = pd.DataFrame({"text": ["Hello", "World"], "label": [1, 0]})
from datasets import Dataset
hf_dataset = Dataset.from_pandas(df)

Code Fragment 2: Loading datasets from local files in CSV, JSON Lines, Parquet, and pandas DataFrame formats. The data_files parameter supports glob patterns for loading multiple files from a directory.

2. Streaming Mode for Large Datasets

Many pretraining datasets are too large to fit on a single disk. Streaming mode solves this by fetching data on demand without downloading the full dataset. Streamed datasets behave like Python iterators: you consume examples one at a time, and the library handles network fetching and decompression transparently.

The following example streams a large web corpus and counts tokens without ever storing the full dataset locally.

from datasets import load_dataset

# Stream FineWeb-Edu: 1.3 trillion tokens, but we only download
# what we iterate over
stream = load_dataset(
    "HuggingFaceFW/fineweb-edu",
    split="train",
    streaming=True,
)

# Take the first 5 examples
for i, example in enumerate(stream):
    text_preview = example["text"][:80].replace("\n", " ")
    print(f"  [{i}] {text_preview}...")
    if i >= 4:
        break

# Streaming supports chaining operations
filtered_stream = (
    stream
    .filter(lambda x: len(x["text"]) > 500)       # Only long documents
    .shuffle(seed=42, buffer_size=1000)             # Shuffle in a buffer
    .take(100)                                       # Take 100 examples
)
examples = list(filtered_stream)
print(f"Collected {len(examples)} examples")

[0] FineWeb-Edu is a dataset of educational web pages collected from the web... [1] Machine learning is a subset of artificial intelligence that focuses on... [2] The Python programming language was created by Guido van Rossum and first... [3] In statistics, a confidence interval is a range of values that is likely... [4] Neural networks are computing systems inspired by biological neural netw... Collected 100 examples

Code Fragment 3: Streaming a 1.3-trillion-token dataset without downloading it. The streaming=True flag returns an iterable dataset; chained operations like .filter(), .shuffle(), and .take() execute lazily, fetching only the examples consumed by the iterator.

Streaming vs. Download: When to Use Which

Use streaming when you need to preview a dataset, train on a dataset larger than your disk, or apply a training loop that consumes data only once. Use full download when you need random access, multiple epochs over the data, or complex transformations that benefit from caching (like tokenization with .map()).

3. Transforming Data: Map, Filter, and Select

The datasets library provides functional-style operations for transforming data. The most important is .map(), which applies a function to every example (or batch of examples) and caches the result. Combined with .filter() and .select(), these operations form a complete preprocessing pipeline.

The example below demonstrates all three operations in a typical text classification workflow.

from datasets import load_dataset
from transformers import AutoTokenizer

dataset = load_dataset("imdb", split="train")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Map: tokenize all examples (batched for speed)
def tokenize_fn(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=256,
    )

tokenized = dataset.map(tokenize_fn, batched=True, batch_size=1000)
print(f"New columns: {tokenized.column_names}")
# ['text', 'label', 'input_ids', 'attention_mask']

# Filter: keep only reviews longer than 100 words
long_reviews = tokenized.filter(lambda x: len(x["text"].split()) > 100)
print(f"After filter: {len(long_reviews)} / {len(tokenized)}")

# Select: grab a specific subset by index
subset = tokenized.select(range(0, 1000))
print(f"Subset size: {len(subset)}")

# Set format for PyTorch (drops non-tensor columns)
subset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
print(f"Tensor shape: {subset[0]['input_ids'].shape}")

New columns: ['text', 'label', 'input_ids', 'attention_mask'] After filter: 21438 / 25000 Subset size: 1000 Tensor shape: torch.Size([256])

Code Fragment 4: Core dataset operations: .map() applies batched tokenization, .filter() selects examples by condition, .select() grabs indices, and .set_format("torch") converts columns to PyTorch tensors on access. Batched mapping with batch_size=1000 is significantly faster than per-example processing.

Batched Mapping for Performance

Always pass batched=True to .map() when possible. Batched mapping processes multiple examples at once, which is dramatically faster for operations like tokenization. You can control the batch size with batch_size and parallelize across CPU cores with num_proc. A typical call for large datasets looks like: dataset.map(fn, batched=True, num_proc=8).

4. Fast Tokenizers and the Tokenizers Library

The tokenizers library (which powers all "fast" tokenizers in Transformers) is a Rust-backed tokenization engine that is 10 to 100 times faster than pure-Python implementations. Every AutoTokenizer.from_pretrained() call returns a fast tokenizer by default when one is available. You can also use the tokenizers library directly to train custom tokenizers from scratch.

The following example trains a BPE tokenizer on a custom corpus.

from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders

# Initialize a BPE tokenizer
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
tokenizer.decoder = decoders.ByteLevel()

# Configure the trainer
trainer = trainers.BpeTrainer(
    vocab_size=32000,
    min_frequency=2,
    special_tokens=["<pad>", "<s>", "</s>", "<unk>", "<mask>"],
    show_progress=True,
)

# Train on text files
tokenizer.train(files=["corpus_part1.txt", "corpus_part2.txt"], trainer=trainer)

# Test the trained tokenizer
encoded = tokenizer.encode("HuggingFace makes NLP easy and accessible.")
print(f"Tokens: {encoded.tokens}")
print(f"IDs:    {encoded.ids}")

# Save for later use
tokenizer.save("my-tokenizer.json")

Tokens: ['H', 'ugging', 'Face', ' makes', ' NLP', ' easy', ' and', ' accessible', '.'] IDs: [72, 15223, 9781, 1838, 24896, 2068, 362, 12110, 17]

Code Fragment 8: Training a custom BPE tokenizer with tokenizers. The ByteLevelBPETokenizer learns merge rules from a text corpus, producing a vocabulary file and merge table saved as JSON. The output shows subword splits where "HuggingFace" is decomposed into three tokens.

Once trained, a custom tokenizer can be loaded into the Transformers ecosystem for use with any model.

from transformers import PreTrainedTokenizerFast

# Wrap the custom tokenizer for use with Transformers
fast_tokenizer = PreTrainedTokenizerFast(
    tokenizer_file="my-tokenizer.json",
    bos_token="<s>",
    eos_token="</s>",
    unk_token="<unk>",
    pad_token="<pad>",
    mask_token="<mask>",
)

# Now usable with any HuggingFace model or Trainer
output = fast_tokenizer("Hello, world!", return_tensors="pt")
print(output.input_ids)

tensor([[ 1, 377, 2910, 14, 1835, 2]])

Code Fragment 9: Wrapping a custom tokenizer with PreTrainedTokenizerFast for compatibility with the Transformers ecosystem. Special tokens (BOS, EOS, PAD, UNK, MASK) must be explicitly specified. The wrapped tokenizer can then be used with any Trainer or model pipeline.

5. Preprocessing Pipelines for Specific Tasks

Different NLP tasks require different preprocessing strategies. The tokenizer must produce inputs that match the model's expected format. Below are three common patterns: sequence classification, token classification (NER), and extractive question answering.

5.1 Sequence Classification

For classification, each example maps to a single label. The preprocessing is straightforward: tokenize the text and keep the label.

from datasets import load_dataset
from transformers import AutoTokenizer

dataset = load_dataset("glue", "sst2", split="train")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocess_classification(examples):
    return tokenizer(examples["sentence"], truncation=True, max_length=128)

processed = dataset.map(preprocess_classification, batched=True)
processed = processed.rename_column("label", "labels")
processed.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

Code Fragment 5: Preprocessing for sequence classification: tokenize with truncation, rename the label column to match the trainer's expectation, and set the output format to PyTorch tensors.

5.2 Token Classification (NER)

Named entity recognition requires aligning token-level labels with subword tokens. When a word is split into multiple subword pieces, only the first piece receives the entity label; the rest are typically set to -100 so that the loss function ignores them.

from datasets import load_dataset
from transformers import AutoTokenizer

dataset = load_dataset("conll2003", split="train")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def align_labels(examples):
    tokenized = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,
    )
    all_labels = []
    for i, labels in enumerate(examples["ner_tags"]):
        word_ids = tokenized.word_ids(batch_index=i)
        label_ids = []
        prev_word = None
        for word_id in word_ids:
            if word_id is None:
                label_ids.append(-100)           # Special tokens
            elif word_id != prev_word:
                label_ids.append(labels[word_id]) # First subword
            else:
                label_ids.append(-100)           # Continuation subword
            prev_word = word_id
        all_labels.append(label_ids)
    tokenized["labels"] = all_labels
    return tokenized

ner_data = dataset.map(align_labels, batched=True)

Code Fragment 6: NER preprocessing with subword label alignment. When a word splits into multiple tokens, only the first subword gets the entity label; subsequent pieces receive -100 so the cross-entropy loss ignores them during training.

5.3 Extractive Question Answering

Extractive QA models predict start and end positions within a context passage. The preprocessing must tokenize the question and context together, then map character-level answer spans to token positions.

from datasets import load_dataset
from transformers import AutoTokenizer

dataset = load_dataset("squad", split="train[:1000]")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")

def preprocess_qa(examples):
    tokenized = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=384,
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )
    start_positions = []
    end_positions = []
    for i, offsets in enumerate(tokenized["offset_mapping"]):
        answer = examples["answers"][tokenized["overflow_to_sample_mapping"][i]]
        if len(answer["answer_start"]) == 0:
            start_positions.append(0)
            end_positions.append(0)
            continue
        start_char = answer["answer_start"][0]
        end_char = start_char + len(answer["text"][0])

        # Find token positions that contain the answer span
        seq_ids = tokenized.sequence_ids(i)
        context_start = next(j for j, s in enumerate(seq_ids) if s == 1)
        context_end = len(seq_ids) - 1 - next(
            j for j, s in enumerate(reversed(seq_ids)) if s == 1
        )

        # If answer is outside the chunk, label as (0, 0)
        if offsets[context_start][0] > start_char or offsets[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            start_tok = next(j for j in range(context_start, context_end + 1)
                             if offsets[j][0] <= start_char < offsets[j][1])
            end_tok = next(j for j in range(context_end, context_start - 1, -1)
                           if offsets[j][0] < end_char <= offsets[j][1])
            start_positions.append(start_tok)
            end_positions.append(end_tok)

    tokenized["start_positions"] = start_positions
    tokenized["end_positions"] = end_positions
    return tokenized

qa_data = dataset.map(
    preprocess_qa,
    batched=True,
    remove_columns=dataset.column_names,
)

Code Fragment 7: Extractive QA preprocessing maps character-level answer spans to token positions. The stride=128 creates overlapping chunks for long contexts, and overflow_to_sample_mapping tracks which chunk belongs to which original example. Unanswerable chunks receive (0, 0) labels.

Long Contexts and Stride

When a context passage exceeds max_length, the tokenizer can split it into overlapping chunks using the stride parameter. Each chunk becomes a separate training example. The overflow_to_sample_mapping field tracks which original example each chunk came from. At inference time, you must aggregate predictions across chunks to find the best answer span.