"First, do no harm. Second, do no hallucination. The stakes here are measured in lives, not loss functions."
Deploy, Hippocratic AI Agent
Healthcare represents both the highest-stakes and highest-potential domain for LLM applications. Medical LLMs can assist with clinical documentation, diagnostic reasoning, patient communication, drug discovery, and literature synthesis. However, the consequences of errors are severe: incorrect medical information can directly harm patients. This creates a unique tension between the transformative potential of AI in healthcare and the stringent safety, privacy, and regulatory requirements that govern medical practice. The fine-tuning techniques from Chapter 14 are frequently used to adapt general models to medical domains.
Prerequisites
This section assumes understanding of LLM application patterns from Section 28.1 and prompt engineering from Section 11.1. Familiarity with RAG from Section 20.1 provides context for the retrieval-augmented application patterns covered here.
1. Medical LLMs
General-purpose LLMs perform surprisingly well on medical benchmarks. GPT-4 passed the United States Medical Licensing Examination (USMLE) with a score above 90%.
GPT-4 scored higher on the USMLE than most medical students on their first attempt. It also cannot take your blood pressure, which keeps human doctors employed for now.
However, medical LLMs fine-tuned on clinical data offer advantages in understanding medical terminology, following clinical reasoning patterns, and generating responses appropriate for healthcare contexts.
| Model | Base | Training Focus | Notable Result |
|---|---|---|---|
| Med-PaLM 2 | PaLM 2 | Medical QA, clinical reasoning | 86.5% on MedQA (expert level) |
| PMC-LLaMA | LLaMA | PubMed Central papers | Open-source biomedical LLM |
| BioMistral | Mistral | Biomedical literature | Strong on clinical NLP tasks |
| Meditron | LLaMA 2 | Medical guidelines, PubMed | Clinical guideline adherence |
High benchmark scores on medical exams do not translate directly to safe clinical deployment. GPT-4 scoring 90% on the USMLE is impressive but misleading as a measure of clinical readiness. The USMLE tests factual recall and reasoning on well-defined multiple-choice questions. Real clinical practice involves ambiguous symptoms, incomplete patient histories, time pressure, emotional context, and the need to know when to say "I do not know." The gap between benchmark performance and clinical safety is where the real engineering challenge lies. Production medical AI systems require: RAG over verified medical databases (not training data recall) as covered in Chapter 20, mandatory clinician review for all patient-facing outputs, comprehensive audit trails, and explicit uncertainty quantification. The healthcare agent patterns from Section 25.5 provide architectural guidance for building these safeguards into the system.
When evaluating medical LLMs, never rely solely on benchmark accuracy. Run a "harm audit" with your clinical team: sample 100 outputs on real patient scenarios and flag any response that could cause harm if a busy clinician acted on it without double-checking. A model with 90% accuracy and 1% harmful-error rate is far more dangerous than a model with 85% accuracy and 0.1% harmful-error rate. The denominator that matters in healthcare is not "questions answered correctly" but "patients not harmed."
2. Clinical NLP Applications
Clinical NLP processes the vast amount of unstructured text in electronic health records (EHRs). Progress notes, discharge summaries, radiology reports, and pathology findings contain critical clinical information that is difficult to query or analyze in text form. The NLP text representation techniques from Chapter 01 provide the foundations for understanding how these models process clinical language. LLMs can extract structured data from these notes, identify patients matching clinical trial criteria, detect adverse drug events, and summarize patient histories. Figure 28.3.1 shows the clinical NLP pipeline for processing EHR text. Code Fragment 28.3.2 below puts this into practice.
# Implementation example
# Key operations: results display, RAG pipeline
from transformers import pipeline
# Clinical NER using a biomedical model
clinical_ner = pipeline(
"token-classification",
model="d4data/biomedical-ner-all",
aggregation_strategy="simple",
)
clinical_note = """Patient presents with persistent cough and shortness of breath
for 2 weeks. History of Type 2 diabetes managed with metformin 500mg.
Chest X-ray shows bilateral infiltrates. Started on azithromycin
and referred for pulmonary function testing."""
entities = clinical_ner(clinical_note)
for ent in entities:
print(f" {ent['entity_group']:>20}: {ent['word']} ({ent['score']:.3f})")
3. Medical Question Answering
This snippet builds a medical question-answering system that retrieves evidence from a clinical knowledge base before generating an answer.
# Implementation example
# Key operations: results display, API interaction
from openai import OpenAI
client = OpenAI()
# Medical QA with safety guardrails
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": """You are a medical information assistant for clinicians.
Provide evidence-based answers citing relevant guidelines and studies.
Always note the level of evidence. Flag when a question requires
specialist consultation. Never provide direct patient treatment
recommendations without specifying they need clinical validation."""},
{"role": "user", "content": """What are the current first-line treatments for
newly diagnosed Type 2 diabetes in adults with HbA1c between 7-8%?"""},
],
)
print(response.choices[0].message.content)
4. Drug Discovery and Molecular Generation
LLMs trained on chemical and molecular data can generate novel drug candidates, predict molecular properties, and optimize lead compounds. These models treat molecules as sequences using SMILES (Simplified Molecular Input Line Entry System) notation and apply the same autoregressive generation techniques used for text. For example, aspirin is represented as CC(=O)Oc1ccccc1C(=O)O; each character becomes a token in the model's vocabulary.
Key models in this space include ChemBERTa-2 (2022, DeepChem), a RoBERTa model pretrained on 77 million SMILES samples for property prediction; MolGPT (2021), which applies GPT-style next-token prediction for de novo molecule generation; and MoLFormer (IBM Research), trained on 1.1 billion SMILES sequences with rotary positional embeddings. Code Fragment 28.3.2 shows property prediction in practice.
# Molecular property prediction with a chemistry LLM
# ChemBERTa encodes SMILES strings and predicts molecular properties
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_id = "seyonec/ChemBERTa-zinc-base-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
# SMILES representations of common drug molecules
molecules = {
"aspirin": "CC(=O)Oc1ccccc1C(=O)O",
"caffeine": "CN1C=NC2=C1C(=O)N(C(=O)N2C)C",
"ibuprofen": "CC(C)Cc1ccc(cc1)C(C)C(=O)O",
}
for name, smiles in molecules.items():
inputs = tokenizer(smiles, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
prediction = outputs.logits.softmax(dim=-1)
print(f"{name:12s}: {prediction[0].tolist()}")
A notable challenge with SMILES is that random token generation can produce syntactically invalid molecules. SELFIES (Self-Referencing Embedded Strings) is an alternative notation that guarantees every generated string decodes to a valid molecule, making it attractive for generative models. Libraries like RDKit handle SMILES parsing and molecular graph operations, while DeepChem provides pretrained chemistry models and featurization tools.
Beyond property prediction, molecular LLMs enable retrosynthesis prediction (working backward from a target molecule to identify synthesis routes) and property-constrained generation (generating molecules that satisfy multiple pharmacological constraints simultaneously, such as binding affinity, solubility, and low toxicity). For a deeper treatment of tokenization strategies and cross-domain comparisons, see Section 34.10: Beyond Text.
5. Protein Structure and Genomics
5.1 Protein Language Models
Proteins are sequences of amino acids drawn from a 20-letter alphabet, making them one of the most natural non-text applications of language modeling. ESM-2 (2023, Meta FAIR) scaled masked language modeling on protein sequences to 15 billion parameters. The internal representations learned by ESM-2 encode 3D structural information so accurately that ESMFold can predict protein structure from a single sequence, approaching AlphaFold2 accuracy without requiring multiple sequence alignments.
ESM-3 (2024, EvolutionaryScale) extended the paradigm to 98 billion parameters with multimodal conditioning on sequence, structure, and function simultaneously. In a landmark result, ESM-3 designed a novel green fluorescent protein (GFP) with less than 20% sequence identity to any natural protein, demonstrating genuine protein design capability. AlphaFold 3 (2024, Google DeepMind) uses a diffusion-based architecture to predict 3D structures of proteins, nucleic acids, ligands, and their complexes.
# Protein embedding with ESM-2
# Each amino acid residue becomes one token (vocab ~33)
from transformers import AutoTokenizer, AutoModel
import torch
model_id = "facebook/esm2_t6_8M_UR50D" # smallest ESM-2 for demo
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)
# Insulin B-chain (30 amino acid residues)
sequence = "FVNQHLCGSHLVEALYLVCGERGFFYTPKT"
inputs = tokenizer(sequence, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Per-residue embeddings encode structural and functional information
embeddings = outputs.last_hidden_state
print(f"Sequence length: {len(sequence)} residues")
print(f"Embedding shape: {embeddings.shape}") # (1, 32, 320) with CLS/EOS
print(f"Unique amino acids: {len(set(sequence))}/20 canonical")
5.2 Genomics: DNA Language Models
The genome is a sequence of four nucleotide bases (A, C, G, T), making DNA another natural fit for language modeling. DNABERT-2 (ICLR 2024) applies BPE tokenization to genomic sequences, learning merges directly from nucleotide data. Evo-2 (Nature, 2025, Arc Institute/NVIDIA) scaled to 40 billion parameters trained on over 9 trillion nucleotides with a 1-million base-pair context window, predicting variant pathogenicity and generating functional DNA sequences.
The Nucleotide Transformer (Nature Methods, 2024, InstaDeep/Google DeepMind) reached 2.5 billion parameters for multi-species genome modeling. These models learn regulatory grammar (promoters, enhancers, splice sites) directly from sequence data, enabling applications in variant effect prediction, gene expression modeling, and synthetic biology. For detailed tokenization strategies and model comparisons, see Section 34.10: Beyond Text.
Healthcare LLM applications must comply with HIPAA (Health Insurance Portability and Accountability Act), which governs the use of Protected Health Information (PHI). This means: no PHI in prompts sent to cloud APIs without a Business Associate Agreement (BAA), data must be encrypted in transit and at rest, access must be logged and auditable, and minimum necessary data should be used. For clinical decision support, FDA clearance may be required depending on the intended use. Software that provides diagnostic recommendations is regulated as a medical device under FDA 21 CFR Part 820.
The regulatory pathway for medical AI is becoming clearer but remains complex. The FDA's "predetermined change control plan" allows AI systems to be updated after approval if the update process was pre-specified. This is critical for LLM-based systems that benefit from continuous improvement. The key distinction is between "AI as tool" (clinician uses AI output as one input to their decision) and "AI as autonomous decision-maker" (AI directly determines treatment). Current regulations strongly favor the former, where the human clinician retains decision authority.
HIPAA-compliant LLM deployment for healthcare (2024/2025). Running LLMs on protected health information (PHI) requires careful infrastructure choices. For cloud APIs: Azure OpenAI offers HIPAA-eligible GPT-4o deployments with a signed BAA (Business Associate Agreement). AWS Bedrock provides HIPAA-eligible access to Claude and other models. Google Vertex AI offers Gemini with HIPAA compliance under their BAA. For on-premises deployment: Llama 3 (70B, 405B) and Mistral Large can run locally using vLLM or TGI on GPU servers within your hospital's data center, avoiding any PHI leaving the network. Key compliance patterns: (1) apply de-identification (remove 18 HIPAA identifiers) before sending text to external APIs whenever possible; (2) use Microsoft Presidio or philter for automated PHI redaction; (3) maintain audit logs of every LLM interaction involving patient data; (4) implement role-based access controls so only authorized clinicians can query patient-specific information. The MedPrompt technique (Microsoft, 2024) improves medical reasoning accuracy by combining chain-of-thought prompting with self-consistency, achieving near-specialist performance on medical QA benchmarks without fine-tuning.
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Who: Clinical informatics team at a large academic medical center
Situation: The center ran 400+ active clinical trials but only 3% of eligible patients were enrolled, largely because matching patients to trials required manual review of eligibility criteria against patient records.
Problem: Trial eligibility criteria were written in complex medical prose with implicit requirements. Patient data was scattered across EHR notes, lab results, and imaging reports in unstructured text.
Dilemma: Rule-based matching systems achieved high precision but low recall (missing eligible patients due to rigid criteria parsing). A general-purpose LLM improved recall but occasionally matched patients who had disqualifying conditions buried deep in their records.
Decision: The team built a pipeline using a fine-tuned BioMistral model for eligibility criteria extraction, combined with a clinical NER model for patient record parsing, and a final LLM verification step.
How: BioMistral extracted structured inclusion/exclusion criteria from trial protocols. A clinical NER model extracted patient conditions, medications, and lab values from EHR notes. A verification LLM cross-checked matches against exclusion criteria, with all positive matches flagged for physician review before patient contact.
Result: Trial enrollment rates increased from 3% to 11% of eligible patients. The system identified 340 additional eligible patients per month. Physician review took 2 minutes per match (versus 45 minutes for manual chart review), and zero ineligible patients were contacted.
Lesson: Healthcare LLM applications require pipeline architectures with explicit verification steps and mandatory human review, because the cost of errors in medical contexts is fundamentally different from other domains.
When debugging a RAG application, log retrieved documents separately from the generated response. This lets you quickly determine whether a bad answer came from poor retrieval (wrong documents) or poor generation (right documents, wrong interpretation).
- Medical LLMs (Med-PaLM 2, BioMistral, Meditron) achieve expert-level performance on medical QA benchmarks but require careful deployment with safety guardrails.
- Clinical NLP extracts structured data from EHR text, enabling clinical trial matching, adverse event detection, and patient history summarization.
- Drug discovery LLMs treat molecules as SMILES/SELFIES sequences, enabling property prediction, de novo generation, and retrosynthesis planning.
- Protein language models (ESM-2, ESM-3) learn structural and functional properties from amino acid sequences; ESM-3 demonstrated novel protein design.
- DNA language models (DNABERT-2, Evo-2, Nucleotide Transformer) learn genomic grammar for variant effect prediction, gene regulation, and synthetic biology.
- HIPAA compliance requires BAAs for cloud APIs, PHI encryption, access logging, and minimum necessary data principles.
- FDA regulation distinguishes between AI as a clinical tool (lighter oversight) and AI as an autonomous decision-maker (medical device regulation).
Medical foundation models are converging toward multimodal systems that combine text, imaging, and genomic data. Google's Med-Gemini (2025) processes clinical notes, radiology images, and pathology slides in a single model, enabling diagnostic reasoning across modalities. Research into federated learning for medical LLMs addresses the privacy challenge by training across hospital networks without centralizing patient data.
Meanwhile, AI-designed clinical trials (using LLMs to optimize trial protocols and patient selection) are reducing the cost and duration of drug development, with several pharmaceutical companies reporting 30 to 40% faster enrollment. For a broader survey of how transformers are being applied to non-text sequential data across genomics, time series, audio, and other domains, see Section 34.10: Beyond Text.
Exercises
Why do medical LLMs (Med-PaLM, BioGPT) need separate evaluation from general-purpose LLMs? What medical benchmarks exist, and what do they measure?
Answer Sketch
Medical LLMs need domain-specific evaluation because general benchmarks do not test clinical knowledge, reasoning with medical terminology, or safety in healthcare contexts. Key benchmarks: MedQA (medical licensing exam questions), PubMedQA (biomedical literature questions), MMLU medical subsets. They measure: factual medical knowledge, clinical reasoning, ability to handle uncertainty, and appropriate use of medical terminology.
Design a clinical NLP pipeline that extracts medications, diagnoses, and procedures from a clinical note. Specify the model choices, preprocessing steps, and output format.
Answer Sketch
Preprocessing: de-identify PHI (names, dates, MRNs). Use a medical NER model (scispaCy or a fine-tuned BioBERT) to extract entities. Classify entities into categories: medication (drug name, dose, frequency), diagnosis (ICD-10 code mapping), procedure (CPT code mapping). Output as FHIR-compatible JSON resources. Validate against a medical terminology service (UMLS, SNOMED CT).
Describe three critical safety considerations for deploying LLMs in healthcare settings. For each, explain the potential harm and the required safeguard.
Answer Sketch
(1) Hallucinated medical advice: the model generates plausible but incorrect treatment recommendations. Safeguard: never present LLM output as medical advice; always require physician review. (2) PHI leakage: patient data appears in model outputs. Safeguard: de-identify all inputs, use on-premise models, and audit outputs. (3) Bias in training data: model performs worse for underrepresented populations. Safeguard: evaluate performance across demographic groups and validate with diverse clinical data.
Write a function that uses an LLM to check for potential drug interactions given a list of medications. Cross-reference with a structured drug database and highlight discrepancies between the LLM's output and the database.
Answer Sketch
Input: list of medication names. Step 1: query the LLM for potential interactions between all pairs. Step 2: query a drug interaction database (e.g., DrugBank API or OpenFDA). Step 3: compare results. Flag cases where the LLM identifies an interaction the database misses (potential hallucination) and where the database identifies one the LLM misses (potential gap). The database is the authority; the LLM supplements.
Design a medical question-answering system that uses RAG to answer questions from clinical guidelines. What retrieval strategy should you use, and how should you handle questions where the guidelines are ambiguous?
Answer Sketch
Use a retrieval system over a curated index of clinical guidelines (e.g., UpToDate, WHO guidelines). Chunk by section with overlap. Retrieve top-k relevant sections using hybrid search (BM25 + embeddings). For ambiguous questions: present multiple guideline perspectives with citations, explicitly note where guidelines disagree or where evidence is limited, and recommend consulting a specialist. Never present a definitive answer for ambiguous clinical questions.
What Comes Next
In the next section, Section 28.4: LLM-Powered Recommendation & Search, we cover LLM-powered recommendation and search systems, which are reshaping how users discover information and products.
Bibliography
Singhal, K., Azizi, S., Tu, T., et al. (2023). "Large Language Models Encode Clinical Knowledge (Med-PaLM)." arXiv:2212.13138
Singhal, K., Tu, T., Gottweis, J., et al. (2023). "Towards Expert-Level Medical Question Answering with Large Language Models (Med-PaLM 2)." arXiv:2305.09617
Chen, Z., Hernandez-Cano, A., Romanou, A., et al. (2023). "Meditron-70B: Scaling Medical Pretraining for Large Language Models." arXiv:2311.16079
Labrak, Y., Bazoge, A., Morin, E., et al. (2024). "BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains." arXiv:2402.10373
Lin, J., Thomas, S., Rajkumar, G., et al. (2023). "Evolutionary-scale Prediction of Atomic-Level Protein Structure with a Language Model (ESM-2)." Science, 379(6637)
Hayes, T., Rao, R., Akin, H., et al. (2024). "Simulating 500 Million Years of Evolution with a Language Model (ESM-3)." bioRxiv
Brixi, G., Durairaj, J., et al. (2025). "Genome Modeling and Design Across All Domains of Life with Evo 2." Nature
Bagal, V., Aggarwal, R., Vinod, P.K., et al. (2021). "MolGPT: Molecular Generation Using a Transformer-Decoder Model." J. Chem. Inf. Model.
