Every vertical LLM application (healthcare, finance, legal, education) needs a small layer of glue between the model and the industry's existing data formats, standards, and domain corpora. This section catalogs that glue layer for each major vertical: FHIR clients and biomedical NLP toolkits in healthcare, the SEC EDGAR corpus and FinBERT family in finance, CourtListener and Legal-BERT in law, and learning-platform integrations in education. These are not LLM libraries themselves; they are the connector libraries and Software Development Kits (SDKs) that let an LLM application talk to the systems and corpora the industry already runs on. Pick from this list when you need to read electronic health records, ingest 10-K filings, retrieve case law, or wire into an LMS without writing the parser yourself.
Each industry has its own connector / SDK (Software Development Kit, a packaged set of code, samples, and documentation that lets you call a service from your application) ecosystem. The libraries below are the most commonly used integration points for building LLM applications in each vertical.
74.2.1 Healthcare libraries
- FHIR standard (Fast Healthcare Interoperability Resources, HL7, 2011+) + smart-on-fhir client is the standard for EHR (Electronic Health Record) data exchange, replacing the older HL7 v2 messages with a REST + JSON model. Its objective is to provide a single API contract every EHR vendor implements, which matters because before FHIR you had to write a custom adapter for every hospital. The core concept is resource-based: Patient, Encounter, Observation, etc., each with a stable schema. Pick FHIR as the default for any EHR integration in 2026; most hospitals offer FHIR endpoints.
- spaCy 3.7+ (Explosion AI, 2015) is the foundational Python NLP (Natural Language Processing) library with first-class transformers integration, plus the medical and legal scispaCy / scispaCy-large variants. Its objective is to be a production-grade NLP pipeline (tokenization, NER (Named Entity Recognition, the task of finding mentions of people, places, drugs, etc. in text), dependency parsing) at sub-millisecond speeds, which matters for high-volume document processing. Pick spaCy for general-purpose NLP pipelines; pair with scispaCy or medspaCy for medical-text processing.
- scispaCy (AllenAI, 2019) is the biomedical-NLP spaCy model trained on PubMed abstracts, with NER for genes, diseases, chemicals, and species. Its objective is to provide biomedical-text-aware tokenization and entity recognition out of the box, which matters because general spaCy models miss biomedical entities. Pick scispaCy for PubMed-style scientific-medical text; for clinical notes, MedSpaCy is more clinical-focused.
- MedSpaCy (community, 2020) and BioMedICUS (U Minnesota, 2017) are additional medical NLP libraries focused on clinical notes (negation, hypothetical, family-history detection). Their objective is to handle the messy realities of clinical-note language (abbreviations, dictation artifacts, mixed structured/unstructured), which matters when scispaCy alone misses these patterns. Pick MedSpaCy for clinical-note processing.
- BiomedBERT (Microsoft, 2020-2024) is the BERT family pretrained on biomedical text, the canonical biomedical embedding model. Its objective is to provide biomedical-text-tuned representations for downstream classification, NER, or retrieval. Pick BiomedBERT for biomedical classification and embedding tasks where general models underperform.
- PyHealth, FHIR.resources, and fhirpy are the Python FHIR client and modeling libraries. PyHealth focuses on healthcare ML workflows (cohort building, predictions); FHIR.resources and fhirpy are lower-level FHIR clients. Pick PyHealth for ML pipelines, fhirpy for raw FHIR API access.
74.2.2 Finance libraries
- SEC EDGAR (SEC, 1984) + sec-edgar-api is the SEC's electronic filings database covering every public US filing, plus the Python client for programmatic access. Its objective is to provide the foundational corpus for fundamental-analysis AI (10-Ks, 10-Qs, S-1s, proxy statements), which matters because EDGAR is free, comprehensive, and the source of truth. Pick EDGAR as the default source for US public-company filings.
- FinBERT (Yang et al., 2020) is the BERT model pretrained on financial text for sentiment classification. Its objective is to provide financial-sentiment-aware embeddings, which matters because general sentiment models mis-classify financial language (a "decline in earnings" is not the same emotional valence as "decline in violence"). Pick FinBERT for financial sentiment baselines; for richer analysis, frontier LLMs typically outperform.
- FinGPT (AI4Finance, 2023) is the open finance LLM family, fine-tuned on financial datasets. Its objective is to provide an open finance-specific LLM as an alternative to BloombergGPT, which matters when proprietary financial LLMs are not accessible. Pick FinGPT for research and open-source projects; for production quality, frontier general LLMs with finance prompts typically win.
- openbb (OpenBB, 2020) is the open-source investment research platform, integrating dozens of data sources behind a unified Python API. Its objective is to be the open-source "Bloomberg terminal" for retail and quant traders, which matters when professional terminal subscriptions are out of reach. Pick openbb for open-source quantitative research.
- yfinance (Ran Aroussi, 2017) is the unofficial Yahoo Finance Python client. Its objective is to provide free historical price and fundamental data for prototypes, which matters when production market-data subscriptions are not justified for early experimentation. Pick yfinance for prototypes and tutorials; for production, Yahoo's terms forbid commercial use and polygon or databento are the right answer.
- polygon (Polygon, 2017) and databento (Databento, 2022) are production-grade market data APIs covering equities, options, and crypto with proper licensing. Their objective is to provide commercially-licensed market data with millisecond latency, which matters for production trading and analytics products. Pick polygon for retail-friendly pricing; databento for high-frequency professional use.
74.2.3 Legal libraries
- CourtListener (Free Law Project, 2010) + courtlistener API is the largest free repository of US case law and federal-court documents. Its objective is to make case law publicly accessible without LexisNexis or Westlaw subscriptions, which matters for academic and access-to-justice projects. Pick CourtListener as the open-data baseline for US case law; for professional research, commercial databases have better coverage and tooling.
- Legal-BERT (community, 2020) is the BERT family pretrained on legal text. Its objective is to provide legal-text-tuned embeddings for downstream classification and retrieval. Pick Legal-BERT for legal-domain classification baselines; for retrieval, modern embedding models (Voyage-law, BGE) typically outperform.
74.2.4 Education libraries
- LTI standard (IMS Global, 2010+) is the Learning Tools Interoperability standard, the canonical protocol for integrating external tools into LMS platforms. Its objective is to make a tool deployable to any LMS (Canvas, Moodle, Blackboard, Schoology) via a single integration, which matters because per-LMS custom integrations do not scale. Pick LTI for any tool that must work across multiple LMS platforms.
- Common Cartridge / IMS Global standards are the broader interoperability layer for educational content packaging. Its objective is to make educational content portable across LMS platforms, which matters for publishers and content creators. Pick Common Cartridge when distributing educational content across multiple LMS platforms.
- Canvas LMS API (Instructure, 2008+) is the Canvas LMS REST API, the most-used LMS API in US higher education. Its objective is to provide programmatic access to Canvas courses, assignments, and grades, which matters when building AI tutoring or grading products. Pick Canvas API for higher-education AI tooling in North America; for K-12 or other regions, the LMS landscape varies.
A 2024-25 trend in vertical AI worth flagging: open-source domain-specific LLMs (OpenMed, BioMedLM, MedFound for medicine; FinGPT for finance) are useful but rarely beat the frontier general models on the same vertical tasks. The standard recipe in 2026 is continued pretraining on a domain corpus plus a normal general post-training, not from-scratch domain models.
What's Next?
In the next section, Section 74.3: Datasets & Benchmarks, we build on the material covered here.