Section 34.10: Beyond Text: LLMs as Universal Sequence Machines

"A language model doesn't care whether its tokens represent English words, amino acids, or musical notes. It just learns what comes next."
Frontier, Polymathic AI Agent

Big Picture

The transformer architecture was invented for machine translation, but its core mechanism (self-attention over sequences of tokens) is domain-agnostic. Researchers have discovered that by designing the right tokenizer for a new domain, they can repurpose the entire LLM training recipe. DNA bases become tokens. Amino acids become tokens. Musical notes, molecular structures, medical events, robot actions, and even weather grids all become token sequences. The result: a single architectural paradigm now powers breakthroughs across biology, chemistry, medicine, music, robotics, climate science, and finance.

Prerequisites

This section builds on tokenization fundamentals (BPE, vocabulary construction, subword merges) from Section 2.1, the transformer architecture (self-attention, positional encoding, encoder-decoder) from Section 4.1, and pretraining objectives (next-token prediction, masked language modeling) from Section 6.1.

2>1. The Universal Recipe

Here is the surprising part: researchers did not need to invent entirely new architectures for each of these domains. Every successful application of LLMs to non-text data follows a remarkably consistent three-step pattern:

Design a domain-specific tokenizer that converts raw data (nucleotide sequences, molecular graphs, audio waveforms, time series values) into a discrete or embeddable sequence of tokens.
Apply a standard transformer architecture, often directly repurposing T5, LLaMA, BERT, or GPT with minimal architectural changes.
Train with standard language modeling objectives: next-token prediction (autoregressive), masked token prediction (BERT-style), or cross-entropy over discrete targets.

The tokenizer is the creative step; everything downstream is borrowed from NLP. This section surveys the major domains, their tokenization strategies, and the key models driving each field.

Key Insight

The tokenizer is the bridge between a new domain and the entire LLM ecosystem. Once you define how to convert your data into a token sequence, you inherit decades of NLP infrastructure: attention mechanisms, scaling laws, fine-tuning recipes, and evaluation frameworks.

2. Tokenization Strategies Across Domains

The following table summarizes how different domains convert their raw data into sequences of discrete tokens that a transformer can process.

Tokenization Strategies by Domain

Strategy	How It Works	Domains
Character/residue-level	Each atomic unit (nucleotide, amino acid, SMILES character) becomes one token	Protein (ESM), DNA (Evo), Molecules (MolGPT)
BPE / subword	Data-driven merging of frequent subsequences, identical to NLP subword tokenization	DNA (DNABERT-2), Code (StarCoder2), some protein models
Uniform quantization	Continuous values are scaled and binned into N discrete buckets (e.g., 256 or 4096 bins)	Time series (Chronos), Robot actions (RT-2, Gato), Game actions
Neural audio codec (RVQ)	A learned encoder compresses waveforms into multi-level discrete codebook tokens via residual vector quantization	Audio (AudioLM, SoundStorm), Music (MusicGen), Speech (VALL-E)
Visual tokenization (VQ-VAE)	Images encoded into a 2D grid of codebook indices, flattened to 1D sequence	Images (DALL-E, LlamaGen), Video (MAGVIT-2)
Spatial patching	Continuous spatial fields divided into local patches, each embedded as a token	Weather (Aurora, Pangu-Weather), Images (ViT)
Domain code vocabularies	Existing standardized codes (ICD-10, procedure codes, event types) used directly as tokens	Electronic Health Records (BEHRT), Finance (event sequences)
Row-as-token	Each data point (table row) becomes a token via meta-learning over synthetic datasets	Tabular data (TabPFN)

Real-World Scenario: The Tokenizer Is All You Need

Who: A computational biology team at a pharmaceutical company exploring whether LLM architectures could predict protein function from amino acid sequences.

Situation: The team had access to a standard GPT-2 scale transformer and a dataset of 250 million protein sequences with functional annotations. They initially planned to design a custom architecture for protein modeling.

Problem: Building and validating a novel architecture would take months of engineering and GPU time. The team questioned whether the architecture itself was the key variable or whether the tokenization scheme mattered more.

Decision: Instead of modifying the architecture, they designed a domain-specific tokenizer that encoded amino acids, secondary structure markers, and conserved motifs as vocabulary tokens. They applied the same approach to two other internal projects: a time-series forecasting model (using quantized price bins as tokens) and a music generation prototype (using audio codec tokens).

Result: All three projects achieved competitive results using the same off-the-shelf transformer architecture with no structural modifications. The protein function predictor matched a custom architecture baseline within 2% accuracy. Total development time was 3 weeks per project instead of the 3 to 4 months estimated for custom architectures.

Lesson: For many non-text domains, the tokenizer does the heavy lifting. Investing engineering effort in a well-designed domain-specific tokenizer often yields better returns than modifying the transformer architecture itself.

3. Genomics: DNA Language Models

The genome is a sequence of four nucleotide bases (A, C, G, T), making it a natural fit for sequence modeling. DNA language models learn the "grammar" of genomes: regulatory elements, coding regions, splice sites, and evolutionary constraints.

3.1 Key Models

DNABERT (2021) pioneered the approach by treating overlapping k-mers (k=6) as tokens and applying BERT-style masked language modeling. DNABERT-2 (ICLR 2024) improved this by switching to BPE tokenization, which eliminated the information leakage problem of overlapping k-mers and reduced sequence length by approximately 5x.

The Nucleotide Transformer (Nature Methods, 2024, InstaDeep/Google DeepMind) scaled to 2.5 billion parameters trained on multi-species genomes, achieving strong performance on variant effect prediction, promoter identification, and splice site detection.

The most ambitious effort is Evo-2 (Nature, 2025, Arc Institute/NVIDIA): a 40-billion parameter model trained on over 9 trillion nucleotides with a 1-million base-pair context window. Evo-2 uses the StripedHyena 2 architecture (a hybrid of attention and state-space layers) and operates at single-nucleotide resolution. It can predict BRCA1 variant pathogenicity with over 90% accuracy without fine-tuning, and has demonstrated the ability to generate functional DNA sequences.

3.2 Tokenization Deep Dive

Original DNABERT used overlapping 6-mers, producing a vocabulary of roughly 4,096 tokens. DNABERT-2 switched to Byte Pair Encoding, learning merges directly from genomic data. Evo models take the simplest approach: single-nucleotide resolution with a vocabulary of just 4 bases plus special tokens. The tradeoff is between sequence compression (BPE reduces length but loses resolution) and fine-grained modeling (single-nucleotide captures every mutation but requires very long contexts).

# Comparing DNA tokenization strategies
# Single-nucleotide vs k-mer vs BPE approaches

sequence = "ATCGATCGATCG" * 100 # 1200bp genomic fragment

# Strategy 1: Single nucleotide (Evo-style)
single_tokens = list(sequence)
print(f"Single nucleotide: {len(single_tokens)} tokens, vocab=4")

# Strategy 2: k-mer (original DNABERT, k=6)
kmers = [sequence[i:i+6] for i in range(len(sequence) - 5)]
print(f"6-mer (overlapping): {len(kmers)} tokens, vocab=~4096")

# Strategy 3: Non-overlapping k-mer
non_overlap = [sequence[i:i+6] for i in range(0, len(sequence) - 5, 6)]
print(f"6-mer (non-overlapping): {len(non_overlap)} tokens, vocab=~4096")

Single nucleotide: 1200 tokens, vocab=4 6-mer (overlapping): 1195 tokens, vocab=~4096 6-mer (non-overlapping): 199 tokens, vocab=~4096

Code Fragment 34.10.1: Comparing DNA tokenization strategies

Key Takeaways

The transformer recipe generalizes beyond text. Tokenize any sequential data, apply the standard training pipeline, and the model learns domain-specific patterns.
Tokenization strategy is the critical design choice. Each domain (genomics, proteins, chemistry, music) requires domain-aware tokenization that preserves the right structural units.
Cross-domain transfer is an open frontier. Models trained on one modality sometimes transfer useful representations to others, but systematic approaches remain an active research area.

Key Insight

Neural audio codecs solve the fundamental dimensionality problem: they compress 24,000 samples/second of raw audio into roughly 75 tokens/second of discrete codes, making audio generation tractable for transformer-scale models. The multi-level codebook hierarchy captures different aspects of the signal: coarse codebooks encode semantic content (what is being said or played), while fine codebooks capture acoustic detail (voice timbre, recording quality).

Warning

EHR language models raise significant privacy and fairness concerns. Patient data is highly sensitive, models can encode and amplify healthcare disparities present in training data, and regulatory frameworks (HIPAA, GDPR) impose strict constraints on how these models can be trained and deployed. Federated learning and differential privacy are active areas of research for addressing these challenges.

Key Insight: The Tokenizer Is the Theory

In NLP, the tokenizer encodes linguistic assumptions (word boundaries, subword structure). In genomics, it encodes biological priors (k-mer frequencies, nucleotide resolution). In time series, it encodes statistical assumptions (quantization granularity, scaling). In every domain, the choice of tokenizer embodies a theory about what matters in the data. Getting the tokenizer right is often more important than architectural innovations.

Research Frontier

Whole-genome foundation models like Evo-2 can now process sequences of over 1 million base pairs, approaching chromosome-scale context.

Open questions include whether these models can learn long-range regulatory interactions (enhancer-promoter loops spanning 100kb+), and whether they can be fine-tuned for clinical variant interpretation at scale. The Caduceus model (2024) explores bidirectional DNA modeling using the Mamba architecture, suggesting that state-space models may be better suited than transformers for the ultra-long sequences typical in genomics.

Library Shortcut: BioNeMo for Genomic Language Models

The Nucleotide Transformer and DNABERT-2 are available on HuggingFace. NVIDIA's BioNeMo platform provides pretrained genomic models with fine-tuning APIs. Evo-2 weights are released through the Arc Institute's GitHub.

4. Protein Language Models

Proteins are sequences of amino acids drawn from a 20-letter alphabet, making them one of the most natural non-text applications of language modeling. Protein language models learn evolutionary constraints, structural preferences, and functional signatures directly from sequence data.

4.1 The ESM Family

Meta's ESM-2 (2023) scaled masked language modeling on protein sequences to 15 billion parameters. The key insight: the internal representations learned by ESM-2 encode 3D structural information so accurately that ESMFold can predict protein structure from a single sequence, approaching AlphaFold2 accuracy without requiring multiple sequence alignments.

ESM-3 (2024, EvolutionaryScale) extended the paradigm to 98 billion parameters with multimodal conditioning on sequence, structure, and function simultaneously. In a landmark result, ESM-3 designed a novel green fluorescent protein (GFP) with less than 20% sequence identity to any natural protein, demonstrating genuine protein design capability.

4.2 Tokenization

Protein tokenization is straightforward: each of the 20 canonical amino acids maps to one token, with special tokens for masking, padding, and non-standard residues (total vocabulary of approximately 33 tokens). Some newer work explores BPE-style sub-residue tokenization for amino acid subsequences, but single-residue tokenization remains dominant due to the biological significance of individual amino acid positions.

# Protein sequence embedding with ESM-2
# Each amino acid residue becomes a single token
import torch

# Example: insulin B-chain (30 residues)
sequence = "FVNQHLCGSHLVEALYLVCGERGFFYTPKT"
print(f"Sequence length: {len(sequence)} residues")
print(f"Amino acid vocabulary: {sorted(set(sequence))}")
print(f"Unique residues used: {len(set(sequence))}/20 canonical")

# With ESM-2, each residue position gets a 2560-dim embedding (15B model)
# that encodes evolutionary, structural, and functional information.
embedding_dim = 2560 # ESM-2 (15B) hidden dimension
print(f"Embedding shape: ({len(sequence)}, {embedding_dim})")

Sequence length: 30 residues Amino acid vocabulary: ['C', 'E', 'F', 'G', 'H', 'K', 'L', 'N', 'P', 'Q', 'R', 'T', 'V', 'Y'] Unique residues used: 14/20 canonical Embedding shape: (30, 2560)

Code Fragment 34.10.2: Protein sequence embedding with ESM-2

Library Shortcut: AlphaFold3 for Protein Language Models

ESM models are available via HuggingFace Transformers (facebook/esm2_t36_3B_UR50D) and Meta's fair-esm library. EvolutionaryScale provides an API for ESM-3. Google DeepMind's AlphaFold3 (2024) extends structure prediction to protein-DNA-RNA-ligand complexes.

5. Molecular Design: Chemistry as Language

Molecules can be represented as text strings using SMILES (Simplified Molecular Input Line Entry System), a linear encoding of molecular graphs. This insight lets researchers apply standard language models to drug discovery and molecular design.

For example, aspirin is represented as CC(=O)OC1=CC=CC=C1C(=O)O in SMILES notation. Each character or multi-character symbol becomes a token, and an autoregressive model can generate novel molecules by sampling token sequences.

5.1 Key Models

MolGPT (2021) applied GPT-style next-token prediction to SMILES strings for drug-like molecule generation. ChemBERTa-2 (2022, DeepChem) pretrained a RoBERTa model on 77 million SMILES samples for molecular property prediction. MoLFormer (2021, IBM Research) trained an efficient transformer with rotary positional embeddings on 1.1 billion unlabeled SMILES sequences.

A key challenge is that SMILES can produce syntactically invalid molecules. The SELFIES (Self-Referencing Embedded Strings) notation guarantees syntactic validity by construction, making it attractive for generative models where every sampled sequence must decode to a valid molecule.

# SMILES tokenization for molecular language models
# Each molecule becomes a sequence of character-level tokens

molecules = {
 "aspirin": "CC(=O)OC1=CC=CC=C1C(=O)O",
 "caffeine": "CN1C=NC2=C1C(=O)N(C(=O)N2C)C",
 "penicillin_g": "CC1(C(N2C(S1)C(C2=O)NC(=O)CC3=CC=CC=C3)C(=O)O)C",
}

for name, smiles in molecules.items():
 # Character-level tokenization (simplest approach)
 tokens = list(smiles)
 print(f"{name:15s}: {len(smiles):3d} chars, "
 f"unique tokens: {len(set(tokens))}")

aspirin : 24 chars, unique tokens: 8 caffeine : 30 chars, unique tokens: 8 penicillin_g : 50 chars, unique tokens: 11

Code Fragment 34.10.3: SMILES tokenization for molecular language models

Library Shortcut: RDKit for Molecular Language Models

RDKit handles SMILES parsing and molecular graph operations. DeepChem provides ChemBERTa models and molecular featurization. HuggingFace hosts multiple molecular language models. TorchDrug offers a PyTorch framework for drug discovery tasks.

6. Time Series Forecasting

Time series data (sensor readings, stock prices, energy consumption, weather observations) consists of continuous numerical values sampled at regular intervals. The key insight that enabled LLM-based forecasting was treating time series prediction as a classification problem over quantized bins.

6.1 Chronos: Regression via Classification

Amazon's Chronos (March 2024) pioneered a remarkably simple approach: scale the time series by its absolute mean, then quantize values into 4,096 uniformly spaced bins between -15 and +15. Each bin index becomes a discrete token. A T5-based encoder-decoder model is then trained with standard cross-entropy loss to predict the next token (bin) in the sequence.

Chronos-2 (October 2025) extended this approach to multivariate time series with group attention and covariate-informed forecasting.

6.2 Other Approaches

TimeGPT (2023, Nixtla) was trained on over 100 billion data points across finance, healthcare, weather, and IoT domains. Lag-Llama (2024) uses a decoder-only transformer with lagged features for probabilistic forecasting. TimesFM (2024, Google) patches consecutive time points into groups and processes them as tokens. Time-LLM (ICLR 2024) reprograms general LLMs like LLaMA for time series without full retraining.

# Chronos-style time series tokenization
# Continuous values -> scaled -> quantized into discrete bins
import numpy as np

# Simulated daily temperature readings (Celsius)
temperatures = np.array([18.2, 19.1, 17.8, 20.5, 22.1, 21.3, 19.7,
 18.5, 23.0, 24.2, 22.8, 21.0, 19.5, 18.0])

# Step 1: Scale by absolute mean (Chronos normalization)
abs_mean = np.abs(temperatures).mean()
scaled = temperatures / abs_mean
print(f"Abs mean: {abs_mean:.2f}, Scaled range: [{scaled.min():.3f}, {scaled.max():.3f}]")

# Step 2: Quantize into N bins between [-15, +15]
n_bins = 4096
bin_edges = np.linspace(-15, 15, n_bins + 1)
tokens = np.digitize(scaled, bin_edges) - 1 # 0-indexed bin IDs
tokens = np.clip(tokens, 0, n_bins - 1)

print(f"Token IDs (first 7): {tokens[:7]}")
print(f"Token range: [{tokens.min()}, {tokens.max()}] out of {n_bins} bins")

Abs mean: 20.26, Scaled range: [0.888, 1.194] Token IDs (first 7): [2169 2181 2163 2199 2221 2210 2188] tokens range: [2157, 2221] out of 4096 bins

Code Fragment 34.10.4: Chronos-style time series tokenization

Library Shortcut: Chronos for Time Series Forecasting

Amazon's chronos-forecasting package provides pretrained models on HuggingFace. Nixtla's neuralforecast offers TimeGPT and other neural forecasting models. Lag-Llama is available on HuggingFace for probabilistic forecasting.

7. Audio and Music: Neural Codec Tokens

Raw audio is a continuous waveform sampled at 16,000 to 48,000 Hz, far too high-dimensional for direct token prediction. The breakthrough was neural audio codecs (SoundStream, EnCodec) that compress waveforms into discrete token sequences using residual vector quantization (RVQ).

Google's AudioLM (2023) demonstrated that a language model trained on these codec tokens can generate coherent speech continuations (maintaining speaker identity and prosody) and even piano music, all without any text transcripts. SoundStorm (2023) improved generation speed by two orders of magnitude using non-autoregressive parallel decoding.

Meta's MusicGen (2023) generates music conditioned on text descriptions using EnCodec tokens with 4 codebooks at 32kHz. The model (available in 300M, 1.5B, and 3.3B parameter sizes) handles the hierarchical structure of music (melody, harmony, rhythm, timbre) through a single autoregressive transformer over interleaved codebook tokens.

Microsoft's VALL-E (2023) demonstrated zero-shot voice cloning from just 3 seconds of reference audio by treating text-to-speech as a codec language modeling problem.

8. Electronic Health Records: Medical Events as Tokens

A patient's medical history is a sequence of clinical events: diagnoses (coded as ICD-10), procedures, medications, lab results, and visits. BEHRT (2020, University College London) pioneered treating this sequence as a "document" where each clinical code is a "word," each visit is a "sentence," and the full patient history is a paragraph.

BEHRT applies BERT-style masked prediction to learn which diagnoses, procedures, and medications tend to co-occur and follow each other. It predicts 301 conditions simultaneously with 8 to 13% improvement over prior deep EHR models.

Hi-BEHRT (2023) added hierarchical attention to handle multimodal longitudinal data (lab values, free-text notes, procedure codes). Multimodal BEHRT (2025) extends the framework to include clinical features, lab results, procedures, and free-text reports, with applications in cancer prognosis.

9. Robotics: Actions as Tokens

A robot's action space (joint angles, gripper commands, navigation waypoints) can be discretized into bins and expressed as token sequences, enabling vision-language models to control physical robots through language-style generation.

Google DeepMind's RT-2 (2023) fine-tuned a large vision-language model on robot demonstration data. Each action dimension (x, y, z translation, rotation, gripper state) is discretized into 256 bins, converted to integer strings, and appended to the model's token stream. The model generates actions by predicting the next tokens, just as it would generate text.

OpenVLA (June 2024, Stanford) released a 7-billion parameter open-source vision-language-action model trained on the Open X-Embodiment dataset (over 1 million episodes across 22 different robot embodiments). DeepMind's Gato (2022) demonstrated the ultimate generalist approach: a single 1.2B transformer that plays 46 Atari games, controls robots, captions images, and chats, all by serializing every modality into a single flat token stream.

For deeper coverage of vision-language-action models and robotic planning, see Section 27.5: Embodied Multimodal Agents and Section 27.6: LLM-Powered Robotics.

10. Other Frontiers

10.1 Weather and Climate

Weather prediction models tokenize gridded atmospheric fields into spatial patches (analogous to Vision Transformer patches). Aurora (Nature, 2025, Microsoft Research) uses Perceiver-style cross-attention to compress pressure-level data into latent tokens processed by a 3D Swin Transformer. GraphCast (Nature, 2023, Google DeepMind) uses graph-based tokenization on icosahedral grids. These models now match or exceed the accuracy of traditional numerical weather prediction systems at a fraction of the computational cost.

10.2 Mathematical Theorem Proving

Formal proof languages (Lean 4, Isabelle, Coq) are tokenized with standard BPE, and LLMs generate proof tactics autoregressively. AlphaProof (Google DeepMind, Nature 2025) achieved a silver medal at the International Mathematical Olympiad 2024 by generating formally verified Lean proofs. DeepSeek-Prover-V2 (2025) and Harmonic Aristotle (2025) have pushed further, with Aristotle achieving gold-medal level performance on 2025 IMO problems.

10.3 Tabular Data

TabPFN (University of Freiburg) treats tabular prediction as in-context learning: each row becomes a token, and the model uses two-way attention (across features within a row, and across rows for the same feature). TabPFN-2.5 (2025) scales to approximately 100,000 data points and 2,000 features, achieving competitive performance with gradient-boosted trees (XGBoost, LightGBM) with zero hyperparameter tuning in a single forward pass.

10.4 Financial Sequences

Kronos (2025) developed a specialized tokenizer for financial candlestick (OHLCV) data, pre-trained autoregressively on price/volume sequences. FinGPT (open-source) takes a data-centric approach to financial LLMs. The challenge in financial applications is that market microstructure creates complex temporal dependencies that differ fundamentally from the statistical patterns in natural language.

10.5 Game Playing and Decision Making

The Decision Transformer (2021, Google Brain/UC Berkeley) reframed reinforcement learning as sequence modeling: condition on desired return, past states, and past actions, then predict the next action. The Multi-Game Decision Transformer (2022) trained a single model that achieves 126% of human-level performance across 41 Atari games, with performance scaling predictably with model size.

11. The Unifying Framework

Across all these domains, a consistent pattern emerges. The core innovation is never the model architecture (transformers work well everywhere); it is always the tokenization strategy that bridges domain-specific data to the general-purpose sequence modeling framework.

This has profound implications for practitioners:

If you have sequential data in any domain, consider whether a tokenizer + transformer approach could work. The barrier to entry has never been lower.
Domain expertise matters more than ML expertise for the tokenization step. A biologist who understands k-mer frequencies will design a better genomics tokenizer than an ML researcher who does not.
Transfer learning across domains is emerging. Models like Gato demonstrate that a single transformer can process tokens from multiple domains simultaneously, suggesting that shared representations across modalities may be learnable.
Scaling laws appear to transfer. The log-linear relationship between compute, data, and performance that Chinchilla established for text LLMs has been observed in protein models (ESM scaling), genomics models (Evo scaling), and time series models (Chronos scaling).

Self-Check

You are working with satellite imagery time series. Design a tokenization strategy that would allow a transformer to forecast future images. What are the tradeoffs between spatial patching, temporal quantization, and neural codec approaches?
A pharmaceutical company asks you to build a model that generates novel drug candidates. Would you use SMILES or SELFIES tokenization? What are the pros and cons of each?
Why might BPE tokenization work better than single-nucleotide tokenization for some genomics tasks, and worse for others?

Research Frontier

Cross-domain foundation models are the next frontier. Can a single model trained on text, protein sequences, molecular SMILES, and genomic data develop shared representations that transfer between domains?

Early evidence from models like Gato and multimodal ESM-3 suggests yes, but the optimal architecture for multi-domain sequence modeling remains an open question.

Another active area is tokenizer learning: instead of hand-designing domain tokenizers, can we learn optimal tokenization end-to-end as part of the training objective?

What's Next?

Having surveyed how transformers have become universal sequence machines across domains from genomics to robotics, we turn to the broader societal implications. In Chapter 35: AI and Society, we examine how these capabilities reshape workforce dynamics, governance challenges, and the long-term trajectory of artificial intelligence.

References & Further Reading

Genomics

Zhou, Z., et al. (2024). "DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome." ICLR 2024.

Introduces byte-pair encoding tokenization for DNA sequences, replacing fixed k-mer approaches and achieving state-of-the-art on genomic benchmarks across species. Shows how NLP tokenization innovations transfer directly to biology.

📄 Paper

Dalla-Torre, H., et al. (2024). "The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics." Nature Methods.

Trains transformer models on 3,200 diverse genomes, achieving strong zero-shot performance on downstream tasks. Demonstrates that the pretraining paradigm scales to genomic data at the multi-species level.

📄 Paper

Brixi, G., et al. (2025). "Genome modeling and design across all domains of life with Evo 2." Nature.

A 7-billion-parameter model trained on 9.3 trillion nucleotide tokens spanning all domains of life. Represents the current frontier of genomic foundation models, capable of generating functional DNA sequences.

📄 Paper

Proteins

Lin, Z., et al. (2023). "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science.

Shows that protein language models trained on evolutionary sequences alone can predict 3D structure, rivaling specialized structure prediction methods. Demonstrates that self-supervised pretraining captures deep biological knowledge.

📄 Paper

Hayes, T., et al. (2024). "Simulating 500 million years of evolution with a language model." bioRxiv (ESM-3).

Scales protein language models to jointly reason over sequence, structure, and function, generating novel fluorescent proteins not found in nature. The most ambitious protein generation result at the time of writing.

📄 Paper

Molecules

Bagal, V., et al. (2021). "MolGPT: Molecular Generation Using a Transformer-Decoder Model." J. Chem. Inf. Model.

Applies GPT-style autoregressive generation to SMILES molecular strings, producing valid and drug-like molecules. An early and influential demonstration that text generation architectures work for molecular design.

📄 Paper

Ahmad, W., et al. (2022). "ChemBERTa-2: Towards Chemical Foundation Models." arXiv:2209.01712.

Applies masked language modeling to chemical SMILES representations, creating foundation models for molecular property prediction. Extends the BERT pretraining paradigm to chemistry.

📄 Paper

Time Series

Ansari, A. F., et al. (2024). "Chronos: Learning the Language of Time Series." arXiv:2403.07815.

Tokenizes real-valued time series into discrete bins and trains a language model on them, achieving strong zero-shot forecasting. A clean demonstration of the "tokenize anything" philosophy discussed in this section.

📄 Paper

Rasul, K., et al. (2024). "Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting." arXiv:2310.08278.

Builds a foundation model for time series using a decoder-only transformer with lagged features. Shows that the LLM pretraining recipe transfers to probabilistic forecasting.

📄 Paper

Jin, M., et al. (2024). "Time-LLM: Time Series Forecasting by Reprogramming Large Language Models." ICLR 2024.

Reprograms frozen text LLMs to process time series through learned input transformations, avoiding expensive retraining. Demonstrates that existing language model knowledge can be leveraged for entirely non-linguistic domains.

📄 Paper

Audio and Music

Borsos, Z., et al. (2023). "AudioLM: a Language Modeling Approach to Audio Generation." IEEE/ACM TASLP.

Treats audio generation as language modeling over discrete audio tokens, producing speech and music with remarkable coherence. The foundational work that established the codec language model paradigm for audio.

📄 Paper

Copet, J., et al. (2023). "Simple and Controllable Music Generation." NeurIPS 2023 (MusicGen).

Generates high-quality music from text descriptions using a single-stage transformer over compressed audio tokens. Demonstrates that text-conditioned generation works across creative audio domains.

📄 Paper

Wang, C., et al. (2023). "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers." arXiv:2301.02111 (VALL-E).

Achieves zero-shot voice cloning from a 3-second sample by treating speech synthesis as conditional language modeling over codec tokens. A breakthrough in applying the in-context learning paradigm to speech.

📄 Paper

EHR and Medical

Li, Y., et al. (2020). "BEHRT: Transformer for Electronic Health Records." Scientific Reports.

Applies BERT-style pretraining to sequences of medical codes from electronic health records, predicting future diagnoses. An early and influential example of treating patient histories as a "language" for transformers.

📄 Paper

Robotics and Actions

Brohan, A., et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." arXiv:2307.15818.

Directly outputs robot actions as text tokens from a vision-language model, transferring web-scale knowledge to physical manipulation. Demonstrates the strongest evidence that language model pretraining aids robotic reasoning.

📄 Paper

Kim, M., et al. (2024). "OpenVLA: An Open-Source Vision-Language-Action Model." arXiv:2406.09246.

Provides an open-source 7B vision-language-action model for robotic manipulation, making frontier robotic AI accessible to the research community. The leading open alternative to proprietary robotic foundation models.

📄 Paper

Reed, S., et al. (2022). "A Generalist Agent." arXiv:2205.06175 (Gato).

The first model to handle text, images, and robot actions within a single transformer, playing games, chatting, and stacking blocks. A proof of concept for the "one model, many modalities" vision.

📄 Paper

Tabular, Weather, and Mathematics

Hollmann, N., et al. (2024). "Accurate predictions on small data with a tabular foundation model." Nature (TabPFN-2).

A transformer foundation model for tabular prediction that achieves strong performance with no gradient-based training on new datasets. Challenges the assumption that tabular data requires specialized algorithms like gradient boosting.

📄 Paper

Bodnar, C., et al. (2025). "Aurora: A Foundation Model of the Atmosphere." Nature.

Applies transformer-based foundation models to weather prediction, matching or exceeding specialized numerical weather models. Demonstrates that the foundation model paradigm works for complex physical simulations.

📄 Paper

AlphaProof team (2025). "Formal Mathematical Reasoning: A New Frontier in AI." Nature.

Combines language models with formal proof verification to solve International Mathematical Olympiad problems. Represents the frontier of AI mathematical reasoning, connecting natural language and formal logic.

📄 Paper