Section 28.3: Healthcare & Biomedical AI

"First, do no harm. Second, do no hallucination. The stakes here are measured in lives, not loss functions."
Deploy, Hippocratic AI Agent

Big Picture

Healthcare represents both the highest-stakes and highest-potential domain for LLM applications. Medical LLMs can assist with clinical documentation, diagnostic reasoning, patient communication, drug discovery, and literature synthesis. However, the consequences of errors are severe: incorrect medical information can directly harm patients. This creates a unique tension between the transformative potential of AI in healthcare and the stringent safety, privacy, and regulatory requirements that govern medical practice. The fine-tuning techniques from Chapter 14 are frequently used to adapt general models to medical domains.

Prerequisites

This section assumes understanding of LLM application patterns from Section 28.1 and prompt engineering from Section 11.1. Familiarity with RAG from Section 20.1 provides context for the retrieval-augmented application patterns covered here.

1. Medical LLMs

General-purpose LLMs perform surprisingly well on medical benchmarks. GPT-4 passed the United States Medical Licensing Examination (USMLE) with a score above 90%.

Fun Fact

GPT-4 scored higher on the USMLE than most medical students on their first attempt. It also cannot take your blood pressure, which keeps human doctors employed for now.

However, medical LLMs fine-tuned on clinical data offer advantages in understanding medical terminology, following clinical reasoning patterns, and generating responses appropriate for healthcare contexts.

Model Comparison

Model	Base	Training Focus	Notable Result
Med-PaLM 2	PaLM 2	Medical QA, clinical reasoning	86.5% on MedQA (expert level)
PMC-LLaMA	LLaMA	PubMed Central papers	Open-source biomedical LLM
BioMistral	Mistral	Biomedical literature	Strong on clinical NLP tasks
Meditron	LLaMA 2	Medical guidelines, PubMed	Clinical guideline adherence

Key Insight

High benchmark scores on medical exams do not translate directly to safe clinical deployment. GPT-4 scoring 90% on the USMLE is impressive but misleading as a measure of clinical readiness. The USMLE tests factual recall and reasoning on well-defined multiple-choice questions. Real clinical practice involves ambiguous symptoms, incomplete patient histories, time pressure, emotional context, and the need to know when to say "I do not know." The gap between benchmark performance and clinical safety is where the real engineering challenge lies. Production medical AI systems require: RAG over verified medical databases (not training data recall) as covered in Chapter 20, mandatory clinician review for all patient-facing outputs, comprehensive audit trails, and explicit uncertainty quantification. The healthcare agent patterns from Section 25.5 provide architectural guidance for building these safeguards into the system.

Tip

When evaluating medical LLMs, never rely solely on benchmark accuracy. Run a "harm audit" with your clinical team: sample 100 outputs on real patient scenarios and flag any response that could cause harm if a busy clinician acted on it without double-checking. A model with 90% accuracy and 1% harmful-error rate is far more dangerous than a model with 85% accuracy and 0.1% harmful-error rate. The denominator that matters in healthcare is not "questions answered correctly" but "patients not harmed."

2. Clinical NLP Applications

Clinical NLP processes the vast amount of unstructured text in electronic health records (EHRs). Progress notes, discharge summaries, radiology reports, and pathology findings contain critical clinical information that is difficult to query or analyze in text form. The NLP text representation techniques from Chapter 01 provide the foundations for understanding how these models process clinical language. LLMs can extract structured data from these notes, identify patients matching clinical trial criteria, detect adverse drug events, and summarize patient histories. Figure 28.3.1 shows the clinical NLP pipeline for processing EHR text. Code Fragment 28.3.2 below puts this into practice.


# Implementation example
# Key operations: results display, RAG pipeline
from transformers import pipeline

# Clinical NER using a biomedical model
clinical_ner = pipeline(
 "token-classification",
 model="d4data/biomedical-ner-all",
 aggregation_strategy="simple",
)

clinical_note = """Patient presents with persistent cough and shortness of breath
for 2 weeks. History of Type 2 diabetes managed with metformin 500mg.
Chest X-ray shows bilateral infiltrates. Started on azithromycin
and referred for pulmonary function testing."""

entities = clinical_ner(clinical_note)
for ent in entities:
 print(f" {ent['entity_group']:>20}: {ent['word']} ({ent['score']:.3f})")

Sign_symptom: persistent cough (0.943) Sign_symptom: shortness of breath (0.921) Detailed_description: Type 2 diabetes (0.967) Medication: metformin (0.988) Dosage: 500mg (0.952) Diagnostic_procedure: Chest X-ray (0.971) Sign_symptom: bilateral infiltrates (0.896) Medication: azithromycin (0.979) Diagnostic_procedure: pulmonary function testing (0.934)

Code Fragment 28.3.1: Implementation example

Figure 28.3.1: Clinical NLP pipeline. EHR text is de-identified and processed by medical LLMs for structured extraction (ICD codes, medication lists), clinical trial matching, and adverse drug event detection, feeding results into clinical decision support systems.

3. Medical Question Answering

This snippet builds a medical question-answering system that retrieves evidence from a clinical knowledge base before generating an answer.


# Implementation example
# Key operations: results display, API interaction
from openai import OpenAI

client = OpenAI()

# Medical QA with safety guardrails
response = client.chat.completions.create(
 model="gpt-4o",
 messages=[
 {"role": "system", "content": """You are a medical information assistant for clinicians.
Provide evidence-based answers citing relevant guidelines and studies.
Always note the level of evidence. Flag when a question requires
specialist consultation. Never provide direct patient treatment
recommendations without specifying they need clinical validation."""},
 {"role": "user", "content": """What are the current first-line treatments for
newly diagnosed Type 2 diabetes in adults with HbA1c between 7-8%?"""},
 ],
)

print(response.choices[0].message.content)

**First-Line Treatment for Newly Diagnosed T2DM (HbA1c 7-8%)** Per ADA 2025 Standards of Care and EASD Consensus Report: 1. **Metformin** remains the preferred initial pharmacotherapy (Level A evidence). Start at 500mg once daily, titrate to 1000mg twice daily over 4-8 weeks. 2. **Lifestyle modifications** (150 min/week moderate exercise, dietary counseling) should be initiated concurrently (Level A). 3. If cardiovascular disease or high risk: consider adding a **GLP-1 receptor agonist** (semaglutide, liraglutide) or **SGLT2 inhibitor** (empagliflozin, dapagliflozin) regardless of HbA1c (Level A). *Note: This is a clinical reference summary. Individual treatment decisions require specialist consultation and consideration of patient comorbidities.*

Code Fragment 28.3.2: Implementation example

4. Drug Discovery and Molecular Generation

LLMs trained on chemical and molecular data can generate novel drug candidates, predict molecular properties, and optimize lead compounds. These models treat molecules as sequences using SMILES (Simplified Molecular Input Line Entry System) notation and apply the same autoregressive generation techniques used for text. For example, aspirin is represented as CC(=O)Oc1ccccc1C(=O)O; each character becomes a token in the model's vocabulary.

Key models in this space include ChemBERTa-2 (2022, DeepChem), a RoBERTa model pretrained on 77 million SMILES samples for property prediction; MolGPT (2021), which applies GPT-style next-token prediction for de novo molecule generation; and MoLFormer (IBM Research), trained on 1.1 billion SMILES sequences with rotary positional embeddings. Code Fragment 28.3.2 shows property prediction in practice.

# Molecular property prediction with a chemistry LLM
# ChemBERTa encodes SMILES strings and predicts molecular properties
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "seyonec/ChemBERTa-zinc-base-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

# SMILES representations of common drug molecules
molecules = {
 "aspirin": "CC(=O)Oc1ccccc1C(=O)O",
 "caffeine": "CN1C=NC2=C1C(=O)N(C(=O)N2C)C",
 "ibuprofen": "CC(C)Cc1ccc(cc1)C(C)C(=O)O",
}

for name, smiles in molecules.items():
 inputs = tokenizer(smiles, return_tensors="pt")
 with torch.no_grad():
 outputs = model(**inputs)
 prediction = outputs.logits.softmax(dim=-1)
 print(f"{name:12s}: {prediction[0].tolist()}")

aspirin : [0.312, 0.688] caffeine : [0.478, 0.522] ibuprofen : [0.285, 0.715]

Code Fragment 28.3.3: Molecular property prediction with a chemistry LLM

A notable challenge with SMILES is that random token generation can produce syntactically invalid molecules. SELFIES (Self-Referencing Embedded Strings) is an alternative notation that guarantees every generated string decodes to a valid molecule, making it attractive for generative models. Libraries like RDKit handle SMILES parsing and molecular graph operations, while DeepChem provides pretrained chemistry models and featurization tools.

Beyond property prediction, molecular LLMs enable retrosynthesis prediction (working backward from a target molecule to identify synthesis routes) and property-constrained generation (generating molecules that satisfy multiple pharmacological constraints simultaneously, such as binding affinity, solubility, and low toxicity). For a deeper treatment of tokenization strategies and cross-domain comparisons, see Section 34.10: Beyond Text.

5. Protein Structure and Genomics

5.1 Protein Language Models

Proteins are sequences of amino acids drawn from a 20-letter alphabet, making them one of the most natural non-text applications of language modeling. ESM-2 (2023, Meta FAIR) scaled masked language modeling on protein sequences to 15 billion parameters. The internal representations learned by ESM-2 encode 3D structural information so accurately that ESMFold can predict protein structure from a single sequence, approaching AlphaFold2 accuracy without requiring multiple sequence alignments.

ESM-3 (2024, EvolutionaryScale) extended the paradigm to 98 billion parameters with multimodal conditioning on sequence, structure, and function simultaneously. In a landmark result, ESM-3 designed a novel green fluorescent protein (GFP) with less than 20% sequence identity to any natural protein, demonstrating genuine protein design capability. AlphaFold 3 (2024, Google DeepMind) uses a diffusion-based architecture to predict 3D structures of proteins, nucleic acids, ligands, and their complexes.

# Protein embedding with ESM-2
# Each amino acid residue becomes one token (vocab ~33)
from transformers import AutoTokenizer, AutoModel
import torch

model_id = "facebook/esm2_t6_8M_UR50D" # smallest ESM-2 for demo
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)

# Insulin B-chain (30 amino acid residues)
sequence = "FVNQHLCGSHLVEALYLVCGERGFFYTPKT"
inputs = tokenizer(sequence, return_tensors="pt")

with torch.no_grad():
 outputs = model(**inputs)

# Per-residue embeddings encode structural and functional information
embeddings = outputs.last_hidden_state
print(f"Sequence length: {len(sequence)} residues")
print(f"Embedding shape: {embeddings.shape}") # (1, 32, 320) with CLS/EOS
print(f"Unique amino acids: {len(set(sequence))}/20 canonical")

Sequence length: 30 residues Embedding shape: torch.Size([1, 32, 320]) Unique amino acids: 14/20 canonical

Code Fragment 28.3.4: Protein embedding with ESM-2

5.2 Genomics: DNA Language Models

The genome is a sequence of four nucleotide bases (A, C, G, T), making DNA another natural fit for language modeling. DNABERT-2 (ICLR 2024) applies BPE tokenization to genomic sequences, learning merges directly from nucleotide data. Evo-2 (Nature, 2025, Arc Institute/NVIDIA) scaled to 40 billion parameters trained on over 9 trillion nucleotides with a 1-million base-pair context window, predicting variant pathogenicity and generating functional DNA sequences.

The Nucleotide Transformer (Nature Methods, 2024, InstaDeep/Google DeepMind) reached 2.5 billion parameters for multi-species genome modeling. These models learn regulatory grammar (promoters, enhancers, splice sites) directly from sequence data, enabling applications in variant effect prediction, gene expression modeling, and synthetic biology. For detailed tokenization strategies and model comparisons, see Section 34.10: Beyond Text.

Warning

Healthcare LLM applications must comply with HIPAA (Health Insurance Portability and Accountability Act), which governs the use of Protected Health Information (PHI). This means: no PHI in prompts sent to cloud APIs without a Business Associate Agreement (BAA), data must be encrypted in transit and at rest, access must be logged and auditable, and minimum necessary data should be used. For clinical decision support, FDA clearance may be required depending on the intended use. Software that provides diagnostic recommendations is regulated as a medical device under FDA 21 CFR Part 820.

Key Insight

The regulatory pathway for medical AI is becoming clearer but remains complex. The FDA's "predetermined change control plan" allows AI systems to be updated after approval if the update process was pre-specified. This is critical for LLM-based systems that benefit from continuous improvement. The key distinction is between "AI as tool" (clinician uses AI output as one input to their decision) and "AI as autonomous decision-maker" (AI directly determines treatment). Current regulations strongly favor the former, where the human clinician retains decision authority.

Production Tip

HIPAA-compliant LLM deployment for healthcare (2024/2025). Running LLMs on protected health information (PHI) requires careful infrastructure choices. For cloud APIs: Azure OpenAI offers HIPAA-eligible GPT-4o deployments with a signed BAA (Business Associate Agreement). AWS Bedrock provides HIPAA-eligible access to Claude and other models. Google Vertex AI offers Gemini with HIPAA compliance under their BAA. For on-premises deployment: Llama 3 (70B, 405B) and Mistral Large can run locally using vLLM or TGI on GPU servers within your hospital's data center, avoiding any PHI leaving the network. Key compliance patterns: (1) apply de-identification (remove 18 HIPAA identifiers) before sending text to external APIs whenever possible; (2) use Microsoft Presidio or philter for automated PHI redaction; (3) maintain audit logs of every LLM interaction involving patient data; (4) implement role-based access controls so only authorized clinicians can query patient-specific information. The MedPrompt technique (Microsoft, 2024) improves medical reasoning accuracy by combining chain-of-thought prompting with self-consistency, achieving near-specialist performance on medical QA benchmarks without fine-tuning.

Self-Check

Q1: Why do medical LLMs need different safety considerations than general-purpose LLMs?

Show Answer

Medical LLMs can directly impact patient health if they provide incorrect information. A hallucinated drug interaction, incorrect dosage, or missed contraindication could lead to patient harm. Medical LLMs need: stronger factual grounding (citations to medical literature), explicit uncertainty communication, clear disclaimers about clinical validation, and guardrails that prevent direct treatment recommendations without appropriate caveats.

Q2: How do LLMs assist with clinical trial matching?

Show Answer

LLMs process unstructured EHR text to extract patient characteristics (diagnoses, lab values, medications, demographics) and match them against clinical trial eligibility criteria. Traditional approaches require manual chart review or rigid rule-based systems. LLMs can understand nuanced inclusion/exclusion criteria expressed in natural language and identify eligible patients at scale, accelerating trial enrollment.

Q3: What HIPAA requirements apply to using cloud LLM APIs for clinical data?

Show Answer

HIPAA requires: a signed Business Associate Agreement (BAA) with the cloud provider before sending any PHI, encryption of PHI in transit and at rest, logging all access to PHI for audit purposes, using the minimum necessary PHI for the task, and ensuring the provider's data handling practices meet HIPAA security standards. Many cloud providers (OpenAI, Google, Azure) offer HIPAA-eligible configurations with BAAs.

Q4: How do protein language models like ESM-2 represent proteins?

Show Answer

Protein language models treat amino acid sequences as text, with each of the 20 standard amino acids as a token. They are trained on millions of protein sequences using masked language modeling (similar to BERT), learning to predict masked amino acids from context. The resulting embeddings capture evolutionary relationships, structural properties, and functional information, enabling zero-shot prediction of protein properties from sequence alone.

Q5: What is the FDA's distinction between "AI as tool" and "AI as autonomous decision-maker"?

Show Answer

"AI as tool" means the clinician uses AI output as one input to their own clinical decision, retaining final authority. This faces lighter regulatory scrutiny. "AI as autonomous decision-maker" means the AI directly determines treatment or diagnosis without human review, facing stringent FDA medical device regulations. Current regulations strongly favor the tool paradigm, requiring human clinicians to remain in the decision loop for patient care.

Real-World Scenario: Clinical Trial Matching with LLM-Powered NLP

Who: Clinical informatics team at a large academic medical center

Situation: The center ran 400+ active clinical trials but only 3% of eligible patients were enrolled, largely because matching patients to trials required manual review of eligibility criteria against patient records.

Problem: Trial eligibility criteria were written in complex medical prose with implicit requirements. Patient data was scattered across EHR notes, lab results, and imaging reports in unstructured text.

Dilemma: Rule-based matching systems achieved high precision but low recall (missing eligible patients due to rigid criteria parsing). A general-purpose LLM improved recall but occasionally matched patients who had disqualifying conditions buried deep in their records.

Decision: The team built a pipeline using a fine-tuned BioMistral model for eligibility criteria extraction, combined with a clinical NER model for patient record parsing, and a final LLM verification step.

How: BioMistral extracted structured inclusion/exclusion criteria from trial protocols. A clinical NER model extracted patient conditions, medications, and lab values from EHR notes. A verification LLM cross-checked matches against exclusion criteria, with all positive matches flagged for physician review before patient contact.

Result: Trial enrollment rates increased from 3% to 11% of eligible patients. The system identified 340 additional eligible patients per month. Physician review took 2 minutes per match (versus 45 minutes for manual chart review), and zero ineligible patients were contacted.

Lesson: Healthcare LLM applications require pipeline architectures with explicit verification steps and mandatory human review, because the cost of errors in medical contexts is fundamentally different from other domains.

Tip: Separate Retrieval from Generation in Your Logs

When debugging a RAG application, log retrieved documents separately from the generated response. This lets you quickly determine whether a bad answer came from poor retrieval (wrong documents) or poor generation (right documents, wrong interpretation).

Key Takeaways

Medical LLMs (Med-PaLM 2, BioMistral, Meditron) achieve expert-level performance on medical QA benchmarks but require careful deployment with safety guardrails.
Clinical NLP extracts structured data from EHR text, enabling clinical trial matching, adverse event detection, and patient history summarization.
Drug discovery LLMs treat molecules as SMILES/SELFIES sequences, enabling property prediction, de novo generation, and retrosynthesis planning.
Protein language models (ESM-2, ESM-3) learn structural and functional properties from amino acid sequences; ESM-3 demonstrated novel protein design.
DNA language models (DNABERT-2, Evo-2, Nucleotide Transformer) learn genomic grammar for variant effect prediction, gene regulation, and synthetic biology.
HIPAA compliance requires BAAs for cloud APIs, PHI encryption, access logging, and minimum necessary data principles.
FDA regulation distinguishes between AI as a clinical tool (lighter oversight) and AI as an autonomous decision-maker (medical device regulation).

Research Frontier

Medical foundation models are converging toward multimodal systems that combine text, imaging, and genomic data. Google's Med-Gemini (2025) processes clinical notes, radiology images, and pathology slides in a single model, enabling diagnostic reasoning across modalities. Research into federated learning for medical LLMs addresses the privacy challenge by training across hospital networks without centralizing patient data.

Meanwhile, AI-designed clinical trials (using LLMs to optimize trial protocols and patient selection) are reducing the cost and duration of drug development, with several pharmaceutical companies reporting 30 to 40% faster enrollment. For a broader survey of how transformers are being applied to non-text sequential data across genomics, time series, audio, and other domains, see Section 34.10: Beyond Text.

Exercises

Exercise 28.3.1: Medical LLM Evaluation Conceptual

Why do medical LLMs (Med-PaLM, BioGPT) need separate evaluation from general-purpose LLMs? What medical benchmarks exist, and what do they measure?

Answer Sketch

Medical LLMs need domain-specific evaluation because general benchmarks do not test clinical knowledge, reasoning with medical terminology, or safety in healthcare contexts. Key benchmarks: MedQA (medical licensing exam questions), PubMedQA (biomedical literature questions), MMLU medical subsets. They measure: factual medical knowledge, clinical reasoning, ability to handle uncertainty, and appropriate use of medical terminology.

Exercise 28.3.2: Clinical NLP Pipeline Coding

Design a clinical NLP pipeline that extracts medications, diagnoses, and procedures from a clinical note. Specify the model choices, preprocessing steps, and output format.

Answer Sketch

Preprocessing: de-identify PHI (names, dates, MRNs). Use a medical NER model (scispaCy or a fine-tuned BioBERT) to extract entities. Classify entities into categories: medication (drug name, dose, frequency), diagnosis (ICD-10 code mapping), procedure (CPT code mapping). Output as FHIR-compatible JSON resources. Validate against a medical terminology service (UMLS, SNOMED CT).

Exercise 28.3.3: Safety in Medical AI Conceptual

Describe three critical safety considerations for deploying LLMs in healthcare settings. For each, explain the potential harm and the required safeguard.

Answer Sketch

(1) Hallucinated medical advice: the model generates plausible but incorrect treatment recommendations. Safeguard: never present LLM output as medical advice; always require physician review. (2) PHI leakage: patient data appears in model outputs. Safeguard: de-identify all inputs, use on-premise models, and audit outputs. (3) Bias in training data: model performs worse for underrepresented populations. Safeguard: evaluate performance across demographic groups and validate with diverse clinical data.

Exercise 28.3.4: Drug Interaction Checker Coding

Write a function that uses an LLM to check for potential drug interactions given a list of medications. Cross-reference with a structured drug database and highlight discrepancies between the LLM's output and the database.

Answer Sketch

Input: list of medication names. Step 1: query the LLM for potential interactions between all pairs. Step 2: query a drug interaction database (e.g., DrugBank API or OpenFDA). Step 3: compare results. Flag cases where the LLM identifies an interaction the database misses (potential hallucination) and where the database identifies one the LLM misses (potential gap). The database is the authority; the LLM supplements.

Exercise 28.3.5: Medical QA System Design Conceptual

Design a medical question-answering system that uses RAG to answer questions from clinical guidelines. What retrieval strategy should you use, and how should you handle questions where the guidelines are ambiguous?

Answer Sketch

Use a retrieval system over a curated index of clinical guidelines (e.g., UpToDate, WHO guidelines). Chunk by section with overlap. Retrieve top-k relevant sections using hybrid search (BM25 + embeddings). For ambiguous questions: present multiple guideline perspectives with citations, explicitly note where guidelines disagree or where evidence is limited, and recommend consulting a specialist. Never present a definitive answer for ambiguous clinical questions.

What Comes Next

In the next section, Section 28.4: LLM-Powered Recommendation & Search, we cover LLM-powered recommendation and search systems, which are reshaping how users discover information and products.

Bibliography

Medical LLMs

Singhal, K., Azizi, S., Tu, T., et al. (2023). "Large Language Models Encode Clinical Knowledge (Med-PaLM)." arXiv:2212.13138

Demonstrates that LLMs can achieve physician-level performance on medical question answering with appropriate prompting strategies. Introduces the medical evaluation framework and human assessment methodology used across subsequent medical LLM papers. Essential reading for understanding the potential and limits of AI in clinical reasoning.

Medical LLMs

Singhal, K., Tu, T., Gottweis, J., et al. (2023). "Towards Expert-Level Medical Question Answering with Large Language Models (Med-PaLM 2)." arXiv:2305.09617

The follow-up that achieves 86.5% on MedQA, reaching expert-level accuracy through ensemble refinement and chain-of-thought prompting. Provides detailed error analysis and comparison with physician performance. Recommended for teams evaluating medical AI capabilities.

Medical LLMs

Open-Source Medical

Chen, Z., Hernandez-Cano, A., Romanou, A., et al. (2023). "Meditron-70B: Scaling Medical Pretraining for Large Language Models." arXiv:2311.16079

Presents an open-source 70B medical LLM trained on curated medical guidelines and PubMed literature. Covers the data curation and continued pre-training methodology for medical domain adaptation. Important for teams building medical AI without access to proprietary models.

Open-Source Medical

Labrak, Y., Bazoge, A., Morin, E., et al. (2024). "BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains." arXiv:2402.10373

Introduces open-source biomedical LLMs based on Mistral with strong performance on clinical NLP tasks. Covers the training data mixture and evaluation across biomedical benchmarks. Recommended for practitioners seeking deployment-ready medical models.

Open-Source Medical

Protein & Genomics

Lin, J., Thomas, S., Rajkumar, G., et al. (2023). "Evolutionary-scale Prediction of Atomic-Level Protein Structure with a Language Model (ESM-2)." Science, 379(6637)

Shows how protein language models trained on amino acid sequences learn to predict 3D structure, achieving accuracy competitive with specialized structure prediction tools. Demonstrates the surprising effectiveness of treating biology as a language problem. Critical for biomedical AI researchers and computational biologists.

Protein & Genomics

Hayes, T., Rao, R., Akin, H., et al. (2024). "Simulating 500 Million Years of Evolution with a Language Model (ESM-3)." bioRxiv

Introduces ESM-3, a 98B-parameter multimodal protein model conditioned on sequence, structure, and function. Demonstrated novel protein design (GFP with <20% identity to natural proteins). Essential for understanding the frontier of AI-driven protein engineering.

Protein & Genomics

Brixi, G., Durairaj, J., et al. (2025). "Genome Modeling and Design Across All Domains of Life with Evo 2." Nature

A 40B-parameter DNA language model trained on 9+ trillion nucleotides with 1M base-pair context. Predicts variant pathogenicity and generates functional DNA. Landmark paper for genomic foundation models.

Protein & Genomics

Drug Discovery & Chemistry

Bagal, V., Aggarwal, R., Vinod, P.K., et al. (2021). "MolGPT: Molecular Generation Using a Transformer-Decoder Model." J. Chem. Inf. Model.

Pioneered GPT-style autoregressive generation of SMILES strings for drug-like molecules. Establishes the paradigm of treating molecular design as a language modeling problem.

Drug Discovery & Chemistry