Section 75.4: Beyond Text: LLMs as Universal Sequence Machines

"A language model doesn't care whether its tokens represent English words, amino acids, or musical notes. It just learns what comes next."
Frontier, Polymathic AI Agent

Big Picture

The transformer architecture was invented for machine translation, but its core mechanism (self-attention over sequences of tokens) is domain-agnostic. Researchers have discovered that by designing the right tokenizer for a new domain, they can repurpose the entire LLM training recipe. DNA bases become tokens. Amino acids become tokens. Musical notes, molecular structures, medical events, robot actions, and even weather grids all become token sequences. The result: a single architectural paradigm now powers breakthroughs across biology, chemistry, medicine, music, robotics, climate science, and finance.

Prerequisites

This section builds on tokenization fundamentals (BPE, vocabulary construction, subword merges) from Section 1.5, the Transformer architecture (self-attention, positional encoding, encoder-decoder) from Section 2.2, and the pretraining objectives (next-token prediction, masked language modeling) from Section 6.1.

75.4.1 The Universal Recipe

Here is the surprising part: researchers did not need to invent entirely new architectures for each of these domains. Every successful application of LLMs to non-text data follows a remarkably consistent three-step pattern:

Design a domain-specific tokenizer that converts raw data (nucleotide sequences, molecular graphs, audio waveforms, time series values) into a discrete or embeddable sequence of tokens.
Apply a standard transformer architecture, often directly repurposing T5, LLaMA, BERT, or GPT with minimal architectural changes.
Train with standard language modeling objectives: next-token prediction (autoregressive), masked token prediction (BERT-style), or cross-entropy over discrete targets.

The tokenizer is the creative step; everything downstream is borrowed from NLP. This section surveys the major domains, their tokenization strategies, and the key models driving each field.

Key Insight

The tokenizer is the bridge between a new domain and the entire LLM ecosystem. Once you define how to convert your data into a token sequence, you inherit decades of NLP infrastructure: attention mechanisms, scaling laws, fine-tuning recipes, and evaluation frameworks.

Universal recipe funnel: seven domains pipe through tokenizers into a shared transformer — **Figure 75.4.1:** The universal recipe behind LLM applications to non-text domains. Seven domain data sources (DNA, proteins, molecules, time series, audio, images, EHR) each pass through a domain-specific tokenizer that is the only creative engineering step. After tokenization, the data is just an integer sequence that the standard transformer stack (BERT, T5, LLaMA, GPT) consumes unchanged with conventional next-token or masked-LM objectives. The right column lists the NLP ecosystem (scaling laws, PEFT, vLLM, RLHF, HF tooling) that each new domain inherits for free. Concrete examples: Evo-2 uses 4 nucleotide tokens with a 1M-bp context; Chronos quantizes time-series values into 4096 bins; EnCodec produces about 75 tokens per second of audio.

80.4.80.4.2 Tokenization Strategies Across Domains

Fun Fact

The same transformer architecture has now been applied to text, code, images, audio, video, protein sequences, DNA, chemical molecules, time series, and chess. The recipe is almost always the same: design a tokenizer, plug in a transformer, train with next-token prediction. The tokenizer is the science; everything downstream is reuse.

The following table summarizes how different domains convert their raw data into sequences of discrete tokens that a transformer can process. The general pipeline for building a domain-specific tokenizer has four steps: raw domain text (proteins, code, molecules, DNA) passes through a pre-tokenizer with domain-specific rules, then BPE or Unigram training on the domain corpus produces a domain vocabulary, followed by vocabulary merging with a base tokenizer to create the final tokenizer. The critical creative step is designing the domain-specific pre-tokenization rules (splitting on amino acids, SMILES bonds, or codons); after that, standard BPE or Unigram training learns subword merges, and the resulting vocabulary can be merged with a base language model tokenizer to handle both natural language and domain sequences. Typical results: proteins use per-residue tokens, code uses keyword tokens, chemistry uses SMILES tokens, DNA uses k-mer tokens.

Table 75.4.1a: Tokenization Strategies by Domain (as of 2026).

Strategy	How It Works	Domains
Character/residue-level	Each atomic unit (nucleotide, amino acid, SMILES character) becomes one token	Protein (ESM), DNA (Evo), Molecules (MolGPT)
BPE / subword	Data-driven merging of frequent subsequences, identical to NLP subword tokenization	DNA (DNABERT-2), Code (StarCoder2), some protein models
Uniform quantization	Continuous values are scaled and binned into N discrete buckets (e.g., 256 or 4096 bins)	Time series (Chronos), Robot actions (RT-2, Gato), Game actions
Neural audio codec (RVQ)	A learned encoder compresses waveforms into multi-level discrete codebook tokens via residual vector quantization	Audio (AudioLM, SoundStorm), Music (MusicGen), Speech (VALL-E)
Visual tokenization (VQ-VAE)	Images encoded into a 2D grid of codebook indices, flattened to 1D sequence	Images (DALL-E, LlamaGen), Video (MAGVIT-2)
Spatial patching	Continuous spatial fields divided into local patches, each embedded as a token	Weather (Aurora, Pangu-Weather), Images (ViT)
Domain code vocabularies	Existing standardized codes (ICD-10, procedure codes, event types) used directly as tokens	Electronic Health Records (BEHRT), Finance (event sequences)
Row-as-token	Each data point (table row) becomes a token via meta-learning over synthetic datasets	Tabular data (TabPFN)

Real-World Scenario: The Tokenizer Is All You Need

Who: A computational biology team at a pharmaceutical company exploring whether LLM architectures could predict protein function from amino acid sequences.

Situation: The team had access to a standard GPT-2 scale transformer and a dataset of 250 million protein sequences with functional annotations. They initially planned to design a custom architecture for protein modeling.

Problem: Building and validating a novel architecture would take months of engineering and GPU time. The team questioned whether the architecture itself was the key variable or whether the tokenization scheme mattered more.

Decision: Instead of modifying the architecture, they designed a domain-specific tokenizer that encoded amino acids, secondary structure markers, and conserved motifs as vocabulary tokens. They applied the same approach to two other internal projects: a time-series forecasting model (using quantized price bins as tokens) and a music generation prototype (using audio codec tokens).

Result: All three projects achieved competitive results using the same off-the-shelf transformer architecture with no structural modifications. The protein function predictor matched a custom architecture baseline within 2% accuracy. Total development time was 3 weeks per project instead of the 3 to 4 months estimated for custom architectures.

Lesson: For many non-text domains, the tokenizer does the heavy lifting. Investing engineering effort in a well-designed domain-specific tokenizer often yields better returns than modifying the transformer architecture itself.

75.4.2 Genomics: DNA Language Models

The genome is a sequence of four nucleotide bases (A, C, G, T), making it a natural fit for sequence modeling. DNA language models learn the "grammar" of genomes: regulatory elements, coding regions, splice sites, and evolutionary constraints.

75.4.2.1 Key Models

DNABERT (2021) pioneered the approach by treating overlapping k-mers (k=6) as tokens and applying BERT-style masked language modeling. DNABERT-2 (ICLR 2024) improved this by switching to BPE tokenization, which eliminated the information leakage problem of overlapping k-mers and reduced sequence length by approximately 5x.

The Nucleotide Transformer (Nature Methods, 2024, InstaDeep/Google DeepMind) scaled to 2.5 billion parameters trained on multi-species genomes, achieving strong performance on variant effect prediction, promoter identification, and splice site detection.

The most ambitious effort is Evo-2 (Nature, 2025, Arc Institute/NVIDIA): a 40-billion parameter model trained on over 9 trillion nucleotides with a 1-million base-pair context window. Evo-2 uses the StripedHyena 2 architecture (a hybrid of attention and state-space layers) and operates at single-nucleotide resolution. It can predict BRCA1 variant pathogenicity with over 90% accuracy without fine-tuning, and has demonstrated the ability to generate functional DNA sequences.

75.4.2.2 Tokenization Deep Dive

Original DNABERT used overlapping 6-mers, producing a vocabulary of roughly 4,096 tokens. DNABERT-2 switched to Byte Pair Encoding, learning merges directly from genomic data. Evo models take the simplest approach: single-nucleotide resolution with a vocabulary of just 4 bases plus special tokens. The tradeoff is between sequence compression (BPE reduces length but loses resolution) and fine-grained modeling (single-nucleotide captures every mutation but requires very long contexts).

# Comparing DNA tokenization strategies
# Single-nucleotide vs k-mer vs BPE approaches
sequence = "ATCGATCGATCG" * 100 # 1200bp genomic fragment
# Strategy 1: Single nucleotide (Evo-style)
single_tokens = list(sequence)
print(f"Single nucleotide: {len(single_tokens)} tokens, vocab=4")
# Strategy 2: k-mer (original DNABERT, k=6)
kmers = [sequence[i:i+6] for i in range(len(sequence) - 5)]
print(f"6-mer (overlapping): {len(kmers)} tokens, vocab=~4096")
# Strategy 3: Non-overlapping k-mer
non_overlap = [sequence[i:i+6] for i in range(0, len(sequence) - 5, 6)]
print(f"6-mer (non-overlapping): {len(non_overlap)} tokens, vocab=~4096")

Output: Single nucleotide: 1200 tokens, vocab=4 6-mer (overlapping): 1195 tokens, vocab=~4096 6-mer (non-overlapping): 199 tokens, vocab=~4096

Code Fragment 75.4.1b: Comparing DNA tokenization strategies

Key Takeaways

The transformer recipe generalizes beyond text. Tokenize any sequential data, apply the standard training pipeline, and the model learns domain-specific patterns.
Tokenization strategy is the critical design choice. Each domain (genomics, proteins, chemistry, music) requires domain-aware tokenization that preserves the right structural units.
Cross-domain transfer is an open frontier. Models trained on one modality sometimes transfer useful representations to others, but systematic approaches remain an active research area.

Key Insight

Neural audio codecs solve the fundamental dimensionality problem: they compress 24,000 samples/second of raw audio into roughly 75 tokens/second of discrete codes, making audio generation tractable for transformer-scale models. The multi-level codebook hierarchy captures different aspects of the signal: coarse codebooks encode semantic content (what is being said or played), while fine codebooks capture acoustic detail (voice timbre, recording quality).

Warning

EHR language models raise significant privacy and fairness concerns. Patient data is highly sensitive, models can encode and amplify healthcare disparities present in training data, and regulatory frameworks (HIPAA, GDPR) impose strict constraints on how these models can be trained and deployed. Federated learning and differential privacy are active areas of research for addressing these challenges.

Key Insight: The Tokenizer Is the Theory

In NLP, the tokenizer encodes linguistic assumptions (word boundaries, subword structure). In genomics, it encodes biological priors (k-mer frequencies, nucleotide resolution). In time series, it encodes statistical assumptions (quantization granularity, scaling). In every domain, the choice of tokenizer embodies a theory about what matters in the data. Getting the tokenizer right is often more important than architectural innovations.

Self-Check

1. You are working with satellite imagery time series. Design a tokenization strategy that would allow a transformer to forecast future images. What are the tradeoffs between spatial patching, temporal quantization, and neural codec approaches?

Show Answer

Three tokenization strategies and their trade-offs: spatial patching (each 16×16 pixel patch = one token) preserves local visual structure and matches Vision Transformer pretraining, but sees no motion. Temporal patching (a sequence of full frames sampled at intervals, one token per frame) captures seasonal and motion patterns but loses spatial detail. 3D tubelets (spatial patches × temporal stride) capture both at the cost of token count exploding quadratically with image size. For multi-year forecasting at coarse spatial resolution, 3D tubelets with aggressive downsampling typically win. For high-resolution change detection within a season, spatial patching with a separate temporal embedding wins.

2. A pharmaceutical company asks you to build a model that generates novel drug candidates. Would you use SMILES or SELFIES tokenization? What are the pros and cons of each?

Show Answer

Both encode molecules as strings, but they make different validity guarantees. SMILES is the chemistry standard (CCO for ethanol). Generative models often produce invalid SMILES (mismatched parentheses, unfilled valences) that must be filtered out at decoding time, which wastes compute and biases RL fine-tuning. SELFIES is designed so that every string in the alphabet encodes a valid molecule; valence constraints are baked into the grammar. SMILES pros: more pretraining data exists, the format is human-readable to medicinal chemists, tokenizer tooling is mature. SELFIES pros: 100% validity rate at generation time (which is decisive for RL-based drug optimization), shorter chains on average for the same molecule. Most modern generative drug-design papers use SELFIES specifically because the validity-by-construction property eliminates the validation rejection step.

3. Why might BPE tokenization work better than single-nucleotide tokenization for some genomics tasks, and worse for others?

Show Answer

Single-nucleotide tokenization treats each A/C/G/T as a token. Maximum granularity, never collapses information, ideal for variant calling and any task where a single base change matters clinically. BPE tokenization merges frequently co-occurring nucleotides into shared tokens (common k-mers in coding regions become single tokens). BPE works better when the task involves recognizing motifs, regulatory elements, or codon-level patterns, fewer tokens means longer effective context, so the model can attend across longer genomic regions. BPE works worse when the task involves rare variants or single-nucleotide effects, because a BPE merge could mask a single-base difference that matters. The choice mirrors the language-vs-character debate in NLP: granularity vs context length is a real trade-off, and the right answer is task-dependent.

Exercises

Exercise 32.10.1: LLMs as Universal Sequence Models Conceptual

Transformer-style architectures have been applied to DNA, proteins, music, time series, and electronic health records. (a) What property of these data types makes them amenable to LLM-style modeling? (b) Why is "just use GPT-4 with the raw bytes" not the right approach? (c) For each domain, what is the tokenizer-equivalent that has to be designed?

Answer Sketch

(a) All these data types are sequences with strong local and long-range structure that's hard to capture by hand-crafted features. The Transformer's self-attention is a general-purpose sequence learner that excels when the right "alphabet" is provided. (b) Frontier LLMs are trained mostly on natural-language text; their tokenizers waste capacity on non-text data (DNA bytes tokenize poorly into cl100k_base), their pretraining objective doesn't match the domain's structure, and the model's weights encode mostly irrelevant priors. (c) Tokenizer-equivalents: DNA = k-mers or byte-level nucleotides; proteins = amino-acid alphabet (20 letters); music = MIDI events or codec tokens (Encodec); time series = quantized value bins or learned codes; EHR = medical event vocabularies (ICD-10, SNOMED).

Exercise 32.10.2: Predict the Transfer Gap Predictive

You take a 1B-parameter protein language model trained on 200M sequences and ask it to predict the structure of a brand-new viral protein. Predict: (a) where most of its predictive power comes from; (b) which kinds of mutations it will struggle with; (c) what additional ingredient brought AlphaFold2 to state-of-the-art that pure-LM approaches lacked.

Answer Sketch

(a) Most predictive power comes from learning evolutionary constraints implicitly: residues that co-vary across the protein family are likely in contact, and the masked-LM objective recovers this. (b) Struggle with mutations that are functional but rare in evolutionary data (designed proteins, lab-evolved variants, single-residue substitutions outside the trained distribution); also with intrinsically disordered regions where the structure isn't well-defined. (c) AlphaFold2 added: (i) explicit multiple-sequence-alignment features as input, (ii) a geometry-aware attention mechanism (invariant point attention) that respects 3D coordinates, (iii) a structural-recycling iterative refinement. Pure-LM approaches like ESM are weaker on structure prediction but stronger on function prediction and faster at inference.

Exercise 32.10.3: Sketch a Time-Series Tokenizer Code Tweak

Sketch a 10-line Python function that turns a 1D time series of floats into discrete tokens suitable for a Transformer, using simple binning. State the two parameters you must tune.

Answer Sketch

import numpy as np
def tokenize_ts(series, n_bins=256, robust=True):
  x = np.asarray(series)
  if robust:
    lo, hi = np.percentile(x, [1, 99])
  else:
    lo, hi = x.min(), x.max()
  edges = np.linspace(lo, hi, n_bins + 1)
  tokens = np.clip(np.digitize(x, edges) - 1, 0, n_bins - 1)
  return tokens.tolist(), edges

Code Fragment 75.4.2: Sketch a 10-line Python function that turns a 1D time series of floats into discrete tokens suitable for a Transformer.

The two parameters: (1) n_bins trades resolution for vocabulary size (1024 is a common choice for financial time series); (2) the binning strategy (linear, log, quantile). Quantile binning gives roughly uniform token frequencies, which is better for the LM training objective; linear is simpler but wastes bins on rare extreme values. More sophisticated approaches (vector quantization, residual quantization a la SoundStream) compress sequences further and are used in production time-series models like Chronos.

Exercise 32.10.4: Cross-Domain Failure Modes Failure Mode

List four ways the "LLMs are universal sequence machines" hypothesis breaks down in practice, and the implication for choosing the right architecture for your domain.

Answer Sketch

(1) Long sequences with sparse signal: a 1M-base genome sequence dwarfs typical LLM context windows, and most positions carry no functional signal. Implication: state-space models (Mamba), strided attention, or domain-specific sparse architectures often beat plain Transformers. (2) Continuous output structure: protein structure is 3D coordinates, not discrete tokens; predicting them as text loses geometric inductive bias. Implication: hybrid architectures with structural decoders. (3) Strict invariances: chemistry molecules are graphs; sequence order is arbitrary. Implication: graph neural networks or SMILES-canonicalization plus careful training to respect the invariance. (4) Tiny data regimes: a rare disease may have 200 patients in the world; LM-style training is data-hungry. Implication: heavy use of transfer learning, retrieval, and few-shot adaptation, sometimes with non-LLM Bayesian approaches as a backbone. The recurring lesson: Transformers are powerful but not free; pick the architecture by data structure, not by hype.

Library Shortcut: BioNeMo for Genomic Language Models

The Nucleotide Transformer and DNABERT-2 are available on Hugging Face. NVIDIA's BioNeMo platform provides pretrained genomic models with fine-tuning APIs. Evo-2 weights are released through the Arc Institute's GitHub.

Show code

# Embed a DNA sequence with Nucleotide Transformer via Hugging Face.
from transformers import AutoTokenizer, AutoModel
import torch

ckpt = "InstaDeepAI/nucleotide-transformer-2.5b-multi-species"
tokenizer = AutoTokenizer.from_pretrained(ckpt)
model = AutoModel.from_pretrained(ckpt).eval()

dna = "ATGCGTACGTTAGCTAGCTAGCTAGCATCGATCG"
inputs = tokenizer(dna, return_tensors="pt")
with torch.no_grad():
    out = model(**inputs, output_hidden_states=True)
# Mean-pool the last hidden state to get a fixed-size genomic embedding.
embedding = out.hidden_states[-1].mean(dim=1)

Code Fragment 75.4.6: Embed a DNA sequence with Nucleotide Transformer via Hugging Face.

75.4.3 Protein Language Models

Proteins are sequences of amino acids drawn from a 20-letter alphabet, making them one of the most natural non-text applications of language modeling. Protein language models learn evolutionary constraints, structural preferences, and functional signatures directly from sequence data.

75.4.3.1 The ESM Family

Meta's ESM-2 (2023) scaled masked language modeling on protein sequences to 15 billion parameters. The key insight: the internal representations learned by ESM-2 encode 3D structural information so accurately that ESMFold can predict protein structure from a single sequence, approaching AlphaFold2 accuracy without requiring multiple sequence alignments.

ESM-3 (2024, EvolutionaryScale) extended the paradigm to 98 billion parameters with multimodal conditioning on sequence, structure, and function simultaneously. In a widely cited result, ESM-3 designed a novel green fluorescent protein (GFP) with less than 20% sequence identity to any natural protein. The result is suggestive of a new kind of design capability, though the GFP example is a relatively well-characterized target where prior structural and functional priors are unusually rich; whether ESM-3-style generation transfers to therapeutically novel folds (de novo enzymes for new substrates, for instance) is still being established in independent labs.

75.4.3.2 Tokenization

Protein tokenization is straightforward: each of the 20 canonical amino acids maps to one token, with special tokens for masking, padding, and non-standard residues (total vocabulary of approximately 33 tokens). Some newer work explores BPE-style sub-residue tokenization for amino acid subsequences, but single-residue tokenization remains dominant due to the biological significance of individual amino acid positions.

# Protein sequence embedding with ESM-2
# Each amino acid residue becomes a single token
import torch
# Example: insulin B-chain (30 residues)
sequence = "FVNQHLCGSHLVEALYLVCGERGFFYTPKT"
print(f"Sequence length: {len(sequence)} residues")
print(f"Amino acid vocabulary: {sorted(set(sequence))}")
print(f"Unique residues used: {len(set(sequence))}/20 canonical")
# With ESM-2, each residue position gets a 2560-dim embedding (15B model)
# that encodes evolutionary, structural, and functional information.
embedding_dim = 2560 # ESM-2 (15B) hidden dimension
print(f"Embedding shape: ({len(sequence)}, {embedding_dim})")

Output: Sequence length: 30 residues Amino acid vocabulary: ['C', 'E', 'F', 'G', 'H', 'K', 'L', 'N', 'P', 'Q', 'R', 'T', 'V', 'Y'] Unique residues used: 14/20 canonical Embedding shape: (30, 2560)

Code Fragment 75.4.3: Protein sequence embedding with ESM-2

Library Shortcut: AlphaFold3 for Protein Language Models

ESM models are available via Hugging Face Transformers (facebook/esm2_t36_3B_UR50D) and Meta's fair-esm library. EvolutionaryScale provides an API for ESM-3. Google DeepMind's AlphaFold3 (2024) extends structure prediction to protein-DNA-RNA-ligand complexes.

Show code

# Per-residue ESM-2 embeddings via fair-esm (the canonical entry point).
import torch, esm

model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
batch_converter = alphabet.get_batch_converter()
model.eval()

data = [("insulin_b", "FVNQHLCGSHLVEALYLVCGERGFFYTPKT")]
_, _, tokens = batch_converter(data)
with torch.no_grad():
    out = model(tokens, repr_layers=[33])
residue_repr = out["representations"][33]  # (B, L+2, 1280)

Code Fragment 75.4.7: Per-residue ESM-2 embeddings via fair-esm (the canonical entry point).

75.4.4 Molecular Design: Chemistry as Language

Molecules can be represented as text strings using SMILES (Simplified Molecular Input Line Entry System), a linear encoding of molecular graphs. This insight lets researchers apply standard language models to drug discovery and molecular design.

For example, aspirin is represented as CC(=O)OC1=CC=CC=C1C(=O)O in SMILES notation. Each character or multi-character symbol becomes a token, and an autoregressive model can generate novel molecules by sampling token sequences.

75.4.4.1 Key Models

MolGPT (2021) applied GPT-style next-token prediction to SMILES strings for drug-like molecule generation. ChemBERTa-2 (2022, DeepChem) pretrained a RoBERTa model on 77 million SMILES samples for molecular property prediction. MoLFormer (2021, IBM Research) trained an efficient transformer with rotary positional embeddings on 1.1 billion unlabeled SMILES sequences.

A key challenge is that SMILES can produce syntactically invalid molecules. The SELFIES (Self-Referencing Embedded Strings) notation guarantees syntactic validity by construction, making it attractive for generative models where every sampled sequence must decode to a valid molecule.

# SMILES tokenization for molecular language models
# Each molecule becomes a sequence of character-level tokens
molecules = {
    "aspirin": "CC(=O)OC1=CC=CC=C1C(=O)O",
    "caffeine": "CN1C=NC2=C1C(=O)N(C(=O)N2C)C",
    "penicillin_g": "CC1(C(N2C(S1)C(C2=O)NC(=O)CC3=CC=CC=C3)C(=O)O)C",
    }
for name, smiles in molecules.items():
    # Character-level tokenization (simplest approach)
    tokens = list(smiles)
    print(f"{name:15s}: {len(smiles):3d} chars, "
        f"unique tokens: {len(set(tokens))}")

Output: aspirin : 24 chars, unique tokens: 8 caffeine : 30 chars, unique tokens: 8 penicillin_g : 50 chars, unique tokens: 11

Code Fragment 75.4.4: SMILES tokenization for molecular language models

Library Shortcut: RDKit for Molecular Language Models

RDKit handles SMILES parsing and molecular graph operations. DeepChem provides ChemBERTa models and molecular featurization. Hugging Face hosts multiple molecular language models. TorchDrug offers a PyTorch framework for drug discovery tasks.

Show code

# Canonicalize SMILES with RDKit, then embed with ChemBERTa-2.
from rdkit import Chem
from transformers import AutoTokenizer, AutoModel
import torch

smiles = "CC(=O)OC1=CC=CC=C1C(=O)O"           # aspirin
canonical = Chem.MolToSmiles(Chem.MolFromSmiles(smiles))

tok = AutoTokenizer.from_pretrained("DeepChem/ChemBERTa-77M-MTR")
model = AutoModel.from_pretrained("DeepChem/ChemBERTa-77M-MTR").eval()
with torch.no_grad():
    out = model(**tok(canonical, return_tensors="pt"))
mol_embedding = out.last_hidden_state.mean(dim=1)  # (1, 384)

Code Fragment 75.4.8: Canonicalize SMILES with RDKit, then embed with ChemBERTa-2.

75.4.5 Time Series Forecasting

Time series data (sensor readings, stock prices, energy consumption, weather observations) consists of continuous numerical values sampled at regular intervals. The key insight that enabled LLM-based forecasting was treating time series prediction as a classification problem over quantized bins.

75.4.5.1 Chronos: Regression via Classification

Amazon's Chronos (March 2024) pioneered a remarkably simple approach: scale the time series by its absolute mean, then quantize values into 4,096 uniformly spaced bins between -15 and +15. Each bin index becomes a discrete token. A T5-based encoder-decoder model is then trained with standard cross-entropy loss to predict the next token (bin) in the sequence.

Chronos-2 (October 2025) extended this approach to multivariate time series with group attention and covariate-informed forecasting.

75.4.5.2 Other Approaches

TimeGPT (2023, Nixtla) was trained on over 100 billion data points across finance, healthcare, weather, and IoT domains. Lag-Llama (2024) uses a decoder-only transformer with lagged features for probabilistic forecasting. TimesFM (2024, Google) patches consecutive time points into groups and processes them as tokens. Time-LLM (ICLR 2024) reprograms general LLMs like LLaMA for time series without full retraining.

# Chronos-style time series tokenization
# Continuous values -> scaled -> quantized into discrete bins
import numpy as np
# Simulated daily temperature readings (Celsius)
temperatures = np.array([18.2, 19.1, 17.8, 20.5, 22.1, 21.3, 19.7,
    18.5, 23.0, 24.2, 22.8, 21.0, 19.5, 18.0])
# Step 1: Scale by absolute mean (Chronos normalization)
abs_mean = np.abs(temperatures).mean()
scaled = temperatures / abs_mean
print(f"Abs mean: {abs_mean:.2f}, Scaled range: [{scaled.min():.3f}, {scaled.max():.3f}]")
# Step 2: Quantize into N bins between [-15, +15]
n_bins = 4096
bin_edges = np.linspace(-15, 15, n_bins + 1)
tokens = np.digitize(scaled, bin_edges) - 1 # 0-indexed bin IDs
tokens = np.clip(tokens, 0, n_bins - 1)
print(f"Token IDs (first 7): {tokens[:7]}")
print(f"Token range: [{tokens.min()}, {tokens.max()}] out of {n_bins} bins")

Output: Abs mean: 20.26, Scaled range: [0.888, 1.194] Token IDs (first 7): [2169 2181 2163 2199 2221 2210 2188] tokens range: [2157, 2221] out of 4096 bins

Code Fragment 75.4.5: Chronos-style time series tokenization

Library Shortcut: Chronos for Time Series Forecasting

Amazon's chronos-forecasting package provides pretrained models on Hugging Face. Nixtla's neuralforecast offers TimeGPT and other neural forecasting models. Lag-Llama is available on Hugging Face for probabilistic forecasting.

Show code

# Zero-shot forecasting with Amazon Chronos (T5 backbone, discrete bin tokens).
import torch
from chronos import ChronosPipeline

pipeline = ChronosPipeline.from_pretrained(
    "amazon/chronos-t5-small",
    device_map="cuda",
    torch_dtype=torch.bfloat16,
)
history = torch.tensor([18.2, 19.1, 17.8, 20.5, 22.1, 21.3, 19.7,
                        18.5, 23.0, 24.2, 22.8, 21.0, 19.5, 18.0])
forecast = pipeline.predict(context=history, prediction_length=7)
# forecast shape: (num_samples, prediction_length); take the median.
median = forecast.median(dim=0).values

Code Fragment 75.4.9: Zero-shot forecasting with Amazon Chronos (T5 backbone, discrete bin tokens).

75.4.6 Audio and Music: Neural Codec Tokens

Raw audio is a continuous waveform sampled at 16,000 to 48,000 Hz, far too high-dimensional for direct token prediction. The breakthrough was neural audio codecs (SoundStream, EnCodec) that compress waveforms into discrete token sequences using residual vector quantization (RVQ).

Google's AudioLM (2023) demonstrated that a language model trained on these codec tokens can generate coherent speech continuations (maintaining speaker identity and prosody) and even piano music, all without any text transcripts. SoundStorm (2023) improved generation speed by two orders of magnitude using non-autoregressive parallel decoding.

Meta's MusicGen (2023) generates music conditioned on text descriptions using EnCodec tokens with 4 codebooks at 32kHz. The model (available in 300M, 1.5B, and 3.3B parameter sizes) handles the hierarchical structure of music (melody, harmony, rhythm, timbre) through a single autoregressive transformer over interleaved codebook tokens.

Microsoft's VALL-E (2023) demonstrated zero-shot voice cloning from just 3 seconds of reference audio by treating text-to-speech as a codec language modeling problem.

75.4.7 Electronic Health Records: Medical Events as Tokens

A patient's medical history is a sequence of clinical events: diagnoses (coded as ICD-10), procedures, medications, lab results, and visits. BEHRT (2020, University College London) pioneered treating this sequence as a "document" where each clinical code is a "word," each visit is a "sentence," and the full patient history is a paragraph.

BEHRT applies BERT-style masked prediction to learn which diagnoses, procedures, and medications tend to co-occur and follow each other. It predicts 301 conditions simultaneously with 8 to 13% improvement over prior deep EHR models.

Hi-BEHRT (2023) added hierarchical attention to handle multimodal longitudinal data (lab values, free-text notes, procedure codes). Multimodal BEHRT (2025) extends the framework to include clinical features, lab results, procedures, and free-text reports, with applications in cancer prognosis.

75.4.8 Robotics: Actions as Tokens

A robot's action space (joint angles, gripper commands, navigation waypoints) can be discretized into bins and expressed as token sequences, enabling vision-language models to control physical robots through language-style generation.

Google DeepMind's RT-2 (2023) fine-tuned a large vision-language model on robot demonstration data. Each action dimension (x, y, z translation, rotation, gripper state) is discretized into 256 bins, converted to integer strings, and appended to the model's token stream. The model generates actions by predicting the next tokens, just as it would generate text.

OpenVLA (June 2024, Stanford) released a 7-billion parameter open-source vision-language-action model trained on the Open X-Embodiment dataset (over 1 million episodes across 22 different robot embodiments). DeepMind's Gato (2022) demonstrated the ultimate generalist approach: a single 1.2B transformer that plays 46 Atari games, controls robots, captions images, and chats, all by serializing every modality into a single flat token stream.

For deeper coverage of vision-language-action models and robotic planning, see Section 32.1: Embodied Multimodal Agents and Section 32.3: LLM-Powered Robotics.

75.4.9 Other Frontiers

Biology, chemistry, healthcare, and robotics each have well-developed tokenization stories. Several other domains have begun adopting the same template, often with surprising speed. We close the chapter with a rapid tour of weather, theorem proving, tabular data, and finance: each is now competitive with or surpassing decades of domain-specific machinery, simply by reframing its data as tokens for a transformer.

75.4.9.1 Weather and Climate

Weather prediction models tokenize gridded atmospheric fields into spatial patches (analogous to Vision Transformer patches). Aurora (Nature, 2025, Microsoft Research) uses Perceiver-style cross-attention to compress pressure-level data into latent tokens processed by a 3D Swin Transformer. GraphCast (Nature, 2023, Google DeepMind) uses graph-based tokenization on icosahedral grids. These models now match or exceed the accuracy of traditional numerical weather prediction systems at a fraction of the computational cost.

75.4.9.2 Mathematical Theorem Proving

Formal proof languages (Lean 4, Isabelle, Coq) are tokenized with standard BPE, and LLMs generate proof tactics autoregressively. AlphaProof (Google DeepMind, Nature 2025) achieved a silver medal at the International Mathematical Olympiad 2024 by generating formally verified Lean proofs. DeepSeek-Prover-V2 (2025) and Harmonic Aristotle (2025) have pushed further, with Aristotle achieving gold-medal level performance on 2025 IMO problems.

75.4.9.3 Tabular Data

TabPFN (University of Freiburg) treats tabular prediction as in-context learning: each row becomes a token, and the model uses two-way attention (across features within a row, and across rows for the same feature). TabPFN-2.5 (2025) scales to approximately 100,000 data points and 2,000 features, achieving competitive performance with gradient-boosted trees (XGBoost, LightGBM) with zero hyperparameter tuning in a single forward pass.

75.4.9.4 Financial Sequences

Kronos (2025) developed a specialized tokenizer for financial candlestick (OHLCV) data, pretrained autoregressively on price/volume sequences. FinGPT (open-source) takes a data-centric approach to financial LLMs. The challenge in financial applications is that market microstructure creates complex temporal dependencies that differ fundamentally from the statistical patterns in natural language.

75.4.9.5 Game Playing and Decision Making

The Decision Transformer (2021, Google Brain/UC Berkeley) reframed reinforcement learning as sequence modeling: condition on desired return, past states, and past actions, then predict the next action. The Multi-Game Decision Transformer (2022) trained a single model that achieves 126% of human-level performance across 41 Atari games, with performance scaling predictably with model size.

75.4.10 The Unifying Framework

Across many of these domains, a consistent pattern emerges. The headline innovation is often the tokenization strategy that bridges domain-specific data to the general-purpose sequence modeling framework, more than any change to the transformer block itself. The "transformers work well everywhere" framing should be read as an empirical observation on the domains where the recipe has been tried, not as a proof that no other architecture would do better. AlphaFold's geometry-aware attention, GraphCast's mesh encoding, and Mamba-style state-space variants for million-base genomes all hint that the "universal transformer" framing is partially a survivorship effect: we mostly hear about the cases where it worked.

This has profound implications for practitioners:

If you have sequential data in any domain, consider whether a tokenizer + transformer approach could work. The barrier to entry has never been lower.
Domain expertise matters more than ML expertise for the tokenization step. A biologist who understands k-mer frequencies will design a better genomics tokenizer than an ML researcher who does not.
Transfer learning across domains is emerging. Models like Gato demonstrate that a single transformer can process tokens from multiple domains simultaneously, suggesting that shared representations across modalities may be learnable.
Scaling laws appear to transfer. The log-linear relationship between compute, data, and performance that Chinchilla established for text LLMs has been observed in protein models (ESM scaling), genomics models (Evo scaling), and time series models (Chronos scaling).

Research Frontier

Whole-genome foundation models

Whole-genome foundation models like Evo-2 can now process sequences of over 1 million base pairs, approaching chromosome-scale context. Open questions include whether these models can learn long-range regulatory interactions (enhancer-promoter loops spanning 100kb+), and whether they can be fine-tuned for clinical variant interpretation at scale. The Caduceus model (2024) explores bidirectional DNA modeling using the Mamba architecture, suggesting that state-space models may be better suited than transformers for the ultra-long sequences typical in genomics.

Cross-domain foundation models and learned tokenizers

Cross-domain foundation models are the next frontier. Can a single model trained on text, protein sequences, molecular SMILES, and genomic data develop shared representations that transfer between domains? Early evidence from models like Gato and multimodal ESM-3 suggests yes, but the optimal architecture for multi-domain sequence modeling remains an open question. Another active area is tokenizer learning: instead of hand-designing domain tokenizers, can we learn optimal tokenization end-to-end as part of the training objective?

What's Next?

Having surveyed how transformers have become universal sequence machines across domains from genomics to robotics, we turn to the broader societal implications. In Chapter 47: Safety, Ethics & Regulation, we examine how these capabilities reshape workforce dynamics, governance challenges, and the long-term trajectory of artificial intelligence.

Further Reading

Genomics

Zhou, Z., et al. (2024). "DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome." ICLR 2024. Introduces byte-pair encoding tokenization for DNA sequences, replacing fixed k-mer approaches and achieving state-of-the-art on genomic benchmarks across species. Shows how NLP tokenization innovations transfer directly to biology.

Dalla-Torre, H., et al. (2024). "The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics." Nature Methods. Trains transformer models on 3,200 diverse genomes, achieving strong zero-shot performance on downstream tasks. Demonstrates that the pretraining paradigm scales to genomic data at the multi-species level.

Brixi, G., et al. (2025). "Genome modeling and design across all domains of life with Evo 2." Nature. A 7-billion-parameter model trained on 9.3 trillion nucleotide tokens spanning all domains of life. Represents the current frontier of genomic foundation models, capable of generating functional DNA sequences.

Proteins

Lin, Z., et al. (2023). "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science. Shows that protein language models trained on evolutionary sequences alone can predict 3D structure, rivaling specialized structure prediction methods. Demonstrates that self-supervised pretraining captures deep biological knowledge.

Hayes, T., et al. (2024). "Simulating 500 million years of evolution with a language model." bioRxiv (ESM-3). Scales protein language models to jointly reason over sequence, structure, and function, generating novel fluorescent proteins not found in nature. The most ambitious protein generation result at the time of writing.

Molecules

Bagal, V., et al. (2021). "MolGPT: Molecular Generation Using a Transformer-Decoder Model." J. Chem. Inf. Model. Applies GPT-style autoregressive generation to SMILES molecular strings, producing valid and drug-like molecules. An early and influential demonstration that text generation architectures work for molecular design.

Ahmad, W., et al. (2022). "ChemBERTa-2: Towards Chemical Foundation Models." arXiv:2209.01712. Applies masked language modeling to chemical SMILES representations, creating foundation models for molecular property prediction. Extends the BERT pretraining paradigm to chemistry.

Time Series

Ansari, A. F., et al. (2024). "Chronos: Learning the Language of Time Series." arXiv:2403.07815. Tokenizes real-valued time series into discrete bins and trains a language model on them, achieving strong zero-shot forecasting. A clean demonstration of the "tokenize anything" philosophy discussed in this section.

Rasul, K., et al. (2024). "Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting." arXiv:2310.08278. Builds a foundation model for time series using a decoder-only transformer with lagged features. Shows that the LLM pretraining recipe transfers to probabilistic forecasting.

Jin, M., et al. (2024). "Time-LLM: Time Series Forecasting by Reprogramming Large Language Models." ICLR 2024. Reprograms frozen text LLMs to process time series through learned input transformations, avoiding expensive retraining. Demonstrates that existing language model knowledge can be leveraged for entirely non-linguistic domains.

Audio and Music

Borsos, Z., et al. (2023). "AudioLM: a Language Modeling Approach to Audio Generation." IEEE/ACM TASLP. Treats audio generation as language modeling over discrete audio tokens, producing speech and music with remarkable coherence. The foundational work that established the codec language model paradigm for audio.

Copet, J., et al. (2023). "Simple and Controllable Music Generation." NeurIPS 2023 (MusicGen). Generates high-quality music from text descriptions using a single-stage transformer over compressed audio tokens. Demonstrates that text-conditioned generation works across creative audio domains.

Wang, C., et al. (2023). "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers." arXiv:2301.02111 (VALL-E). Achieves zero-shot voice cloning from a 3-second sample by treating speech synthesis as conditional language modeling over codec tokens. A breakthrough in applying the in-context learning paradigm to speech.

EHR and Medical

Li, Y., et al. (2020). "BEHRT: Transformer for Electronic Health Records." Scientific Reports. Applies BERT-style pretraining to sequences of medical codes from electronic health records, predicting future diagnoses. An early and influential example of treating patient histories as a "language" for transformers.

Robotics and Actions

Brohan, A., et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." arXiv:2307.15818. Directly outputs robot actions as text tokens from a vision-language model, transferring web-scale knowledge to physical manipulation. Demonstrates the strongest evidence that language model pretraining aids robotic reasoning.

Kim, M., et al. (2024). "OpenVLA: An Open-Source Vision-Language-Action Model." arXiv:2406.09246. Provides an open-source 7B vision-language-action model for robotic manipulation, making frontier robotic AI accessible to the research community. The leading open alternative to proprietary robotic foundation models.

Reed, S., et al. (2022). "A Generalist Agent." arXiv:2205.06175 (Gato). The first model to handle text, images, and robot actions within a single transformer, playing games, chatting, and stacking blocks. A proof of concept for the "one model, many modalities" vision.

Tabular, Weather, and Mathematics

Hollmann, N., et al. (2024). "Accurate predictions on small data with a tabular foundation model." Nature (TabPFN-2). A transformer foundation model for tabular prediction that achieves strong performance with no gradient-based training on new datasets. Challenges the assumption that tabular data requires specialized algorithms like gradient boosting.

Bodnar, C., et al. (2025). "Aurora: A Foundation Model of the Atmosphere." Nature. Applies transformer-based foundation models to weather prediction, matching or exceeding specialized numerical weather models. Demonstrates that the foundation model paradigm works for complex physical simulations.

AlphaProof team (2025). "Formal Mathematical Reasoning: A New Frontier in AI." Nature. Combines language models with formal proof verification to solve International Mathematical Olympiad problems. Represents the frontier of AI mathematical reasoning, connecting natural language and formal logic.