Section 7.4a: Multilingual Evaluation, Adaptation & Model Families

The first 80 percent of a multilingual model comes from data. The last 20 percent, the part that decides whether a Burmese teenager actually trusts your chatbot, comes from caring enough to evaluate it in Burmese.
Bert, Polyglot AI Agent

Big Picture

This section continues from Section 7.4, which covered multilingual pretraining objectives, the encoder lineage from mBERT to XLM-R, low-resource language challenges, and cultural bias. Here we turn from the descriptive (how multilingual models work and where they fail) to the prescriptive: how to measure multilingual capability rigorously, how to adapt a base model to a new target language with the three-stage pipeline (vocabulary extension, continued pretraining, cross-lingual instruction tuning), and which production-grade multilingual model families are worth shortlisting in 2026.

Prerequisites

This section continues from Section 7.4. Familiarity with the curse of multilinguality, tokenization fertility, and the cultural-bias discussion is assumed.

Fun Fact: The Race to Forget

Open-weights models like Llama 3, Mixtral, and Qwen 2.5 caught up to closed frontier models so quickly that several Big Lab roadmaps quietly stopped trumpeting raw capability gains and started emphasizing safety, agents, and tools instead. The community sometimes calls this the 'Llama Effect': the day the open model gets within 5% of GPT-4 is the day the trillion-parameter race becomes a moat-defense exercise instead of a benchmark sprint.

7.4.4 Multilingual Evaluation Benchmarks

Evaluating multilingual LLMs requires benchmarks that span multiple languages and tasks. Several important benchmarks have emerged:

Table 7.4.1: Multilingual Evaluation Benchmarks Comparison (as of 2026).

Benchmark	Languages	Tasks	Key Feature
XTREME	40	Classification, QA, retrieval, structured prediction	Broad task coverage; includes low-resource languages
XTREME-R	50	Extended XTREME with retrieval tasks	Adds cross-lingual retrieval; harder evaluation
MEGA	70+	NLU, generation, reasoning	Specifically designed for generative LLMs
FLORES-200	200	Machine translation	Covers 200 languages with professional translations
Belebele	122	Reading comprehension	Parallel QA across 122 languages; isolates language vs. knowledge
Global MMLU	42	Multitask knowledge	Culturally adapted MMLU; not just translated

Multilingual QA performance gap: bar chart of representative high, medium, and low-resource languages with English as the baseline; the lowest-resource language trails English by 51 percentage points

Figure 7.4.4b: Illustrative performance gap on multilingual QA benchmarks. Low-resource languages can trail English by 40+ percentage points on the same task.

7.4.5 Adapting English-Centric Models to New Languages

Given the dominance of English in LLM training data, a practical question arises: how can we adapt an existing English-centric model to serve a new target language well? Several techniques have proven effective.

7.4.5.1 Vocabulary Extension

The first bottleneck is often the tokenizer. A tokenizer trained primarily on English will fragment text in other scripts into many small, semantically meaningless tokens. Vocabulary extension adds new tokens specific to the target language:

# Vocabulary extension for a new language
from transformers import AutoTokenizer, AutoModelForCausalLM
from tokenizers import trainers, Tokenizer, models
# Load base English-centric model
base_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B")
# Train a new tokenizer on target language data (e.g., Hindi)
# target_corpus = load_dataset(...) # large Hindi text corpus
# new_vocab = train_new_vocab(target_corpus, vocab_size=20000)
# Identify tokens in new vocab not in base vocab
# to_add = [t for t in new_vocab if t not in base_tokenizer.vocab]
# base_tokenizer.add_tokens(to_add)
# Resize model embeddings to accommodate new tokens
# base_model.resize_token_embeddings(len(base_tokenizer))
# Initialize new embeddings (e.g., as mean of related existing embeddings)
# or with small random values

Code Fragment 7.4.5a: Vocabulary extension for a new language.

7.4.5.2 Continued Pretraining

After vocabulary extension, the model needs exposure to text in the target language. Continued pretraining (also called domain-adaptive pretraining or language-adaptive pretraining) trains the model on a large corpus of target language text, with several considerations:

Data mixing: Mix target language data (e.g., 70%) with English data (e.g., 30%) to prevent catastrophic forgetting of English capabilities
Learning rate: Use a smaller learning rate than initial pretraining (typically 10x to 100x smaller) to avoid disrupting existing knowledge
Duration: Continued pretraining typically lasts for 1B to 100B tokens, depending on the target language's distance from the base model's training data
Evaluation: Monitor both target language performance and English performance throughout training to detect catastrophic forgetting early

Warning: Catastrophic Forgetting

Continued pretraining on a new language can rapidly degrade English performance if not carefully managed. The model essentially "forgets" English knowledge as it adapts to the new language. Mixing in English data and using conservative learning rates help, but some forgetting is often unavoidable. For applications requiring strong multilingual performance, consider using a model that was multilingual from the start, rather than adapting an English-centric model.

7.4.5.3 Cross-Lingual Instruction Tuning

After language-adaptive pretraining, the model understands the target language but may not follow instructions well in it. Cross-lingual instruction tuning fine-tunes the model on instruction-following examples in the target language. These can be:

Translated: Machine-translate existing English instruction datasets (cheap but may introduce translation artifacts)
Native: Collect instruction-response pairs written natively in the target language (expensive but higher quality)
Mixed: Combine translated and native examples, with native examples for culturally sensitive topics

Key Insight

The full pipeline for language adaptation follows three stages: (1) vocabulary extension to improve tokenization efficiency, (2) continued pretraining on target language data to build language understanding, and (3) cross-lingual instruction tuning to enable instruction following. Each stage addresses a different aspect of the problem. Skipping vocabulary extension leads to high per-token costs; skipping continued pretraining leads to poor fluency; skipping instruction tuning leads to a model that understands the language but cannot follow user requests.

7.4.6 Multilingual Model Families

Several model families have been designed with multilingual capability as a core goal rather than an afterthought:

Table 7.4.2b: Multilingual Model Families Comparison (as of 2026).

Model	Languages	Strength	Approach
Qwen 2.5	29+	CJK languages, English	Large-scale multilingual pretraining with balanced sampling
Aya (Cohere)	101	Broad coverage, instruction-tuned	Community-sourced multilingual instruction data from native speakers
BLOOM	46+	African and Southeast Asian languages	Deliberate inclusion of underrepresented languages in pretraining
Gemma 3	35+	European and Asian languages	Curated multilingual data with quality filtering per language
Llama-3.1	8	Major world languages	High quality on supported languages; limited coverage

Note: The Aya Initiative

Cohere's Aya project is notable for its community-driven approach to multilingual AI. Rather than relying solely on web-scraped data or machine translation, Aya recruited native speakers from over 100 countries to create instruction-following examples in their own languages. This produces more natural, culturally appropriate training data than translation-based approaches. The resulting Aya 101 model covers more languages than any other open instruction-tuned model, though it remains smaller than frontier models in raw capability.

Key Insight: The Sapir-Whorf Hypothesis in Artificial Minds

The Sapir-Whorf hypothesis in linguistics proposes that the structure of a language shapes the thoughts and perceptions of its speakers. LLMs provide an unexpected testing ground for this idea. When a model trained predominantly on English encounters questions about concepts that are linguistically encoded differently across languages (such as evidentiality in Turkish or absolute vs. relative spatial reference frames in Guugu Yimithirr), it defaults to English-encoded conceptualizations. This is not merely a bias problem; it suggests that the model's "worldview" is genuinely shaped by its training language distribution, much as Sapir and Whorf proposed for human cognition. The tokenization tax compounds this effect: languages that require more tokens per concept are literally given less computational attention per idea. Multilingual LLMs thus sit at the intersection of computational linguistics, philosophy of mind, and social justice, a combination that makes this one of the most consequential open problems in AI.

Research Frontier

Culturally-aware language models. Beyond multilingual capability, recent work focuses on cultural alignment: ensuring models behave appropriately across cultural contexts, not just linguistic ones. CulturalBench (2024) evaluates models on culture-specific knowledge across 45 countries. Meanwhile, research on "conceptual blind spots" reveals that models often lack basic knowledge about non-Western cultural practices, historical events, and social norms. Community-driven initiatives like Aya 23 and the Masakhane collective continue to close the data gap for African and other underrepresented languages. The integration of culturally-aware evaluation with safety frameworks (Section 37.3) remains an active and important area of development.

Key Takeaways

Multilingual benchmarks like XTREME, FLORES-200, Belebele, and Global MMLU enable systematic measurement of multilingual capabilities, revealing performance gaps of 40+ percentage points between high and low-resource languages.
Language adaptation follows a three-stage pipeline: vocabulary extension (tokenization efficiency), continued pretraining (language understanding), and cross-lingual instruction tuning (instruction following). Each stage is necessary for a complete solution.
Catastrophic forgetting is a real risk when adapting an English-centric model: conservative learning rates and English-data mixing reduce, but rarely eliminate, the loss of English capability.
Community-driven initiatives like Aya and Masakhane demonstrate that high-quality multilingual AI requires engagement with native speaker communities, not just machine translation of English resources.
Model family choice matters more than benchmark numbers. Qwen for CJK, Aya for broad coverage, BLOOM for African and Southeast Asian languages, Gemma for curated European/Asian quality, Llama for narrow but deep major-language coverage.

Self-Check

1. Why is tokenization efficiency a critical issue for multilingual LLMs, and what are its practical consequences?

Show Answer

Tokenizers trained primarily on English fragment text in other scripts (Thai, Khmer, Amharic, etc.) into many small tokens. A single word that takes 1 to 2 tokens in English may require 5 to 10 tokens in a low-resource language. The practical consequences are significant: API costs scale with token count (so users of these languages pay more), the effective context window shrinks proportionally (less information fits in the model's context), and generation speed decreases (more tokens must be generated). This "tokenization tax" compounds existing disadvantages for underserved languages.

2. Describe the three-stage pipeline for adapting an English-centric model to a new target language.

Show Answer

Stage 1, Vocabulary Extension: Train a new tokenizer on target language data, identify tokens not present in the base vocabulary, and add the most frequent new tokens. Resize the model's embedding matrix and initialize new embeddings. Stage 2, Continued Pretraining: Train the model on a mix of target language data (roughly 70%) and English data (roughly 30%) using a conservative learning rate to prevent catastrophic forgetting. Stage 3, Cross-lingual Instruction Tuning: Fine-tune on instruction-following examples in the target language, using a combination of translated and natively written examples. Each stage addresses a different limitation: tokenization efficiency, language understanding, and instruction following.

Exercises

Exercise 7.4a.1: Predict the Tokenizer Tax Predictive

The same paragraph in English tokenizes to 100 tokens with cl100k_base. Predict the token count for the same content in: (a) German; (b) Japanese (kanji-heavy); (c) Hindi (Devanagari script). What is the operational implication for non-English users?

Answer Sketch

(a) German: ~120-140 tokens. Long compound words split into more subwords than English. (b) Japanese: ~150-200 tokens. CJK characters often get one or two BPE tokens per character, much worse than English's 1 token per ~4 characters. (c) Hindi: ~250-400 tokens. Devanagari coverage in English-centric tokenizers is poor; many characters fall back to byte-level UTF-8 encoding (3 bytes/char). Implication: per-message API costs are 2-4x higher for non-English users, and the effective context window shrinks proportionally. Fixes include language-specific tokenizers (e.g., Sarvam for Indic) or models trained with rebalanced tokenizer training data.

Exercise 7.4a.2: Add Language Detection Routing Code Tweak

Sketch a 6-line dispatcher that detects the language of an incoming message and routes to a language-specific model when one is available, else falls back to GPT-4o. Use fasttext or langdetect for the detection. What edge case must you handle?

Answer Sketch

from langdetect import detect
specialists = {"hi": "sarvam-2b", "ja": "rakuten-7b", "ar": "jais-13b"}
def route(msg):
  try: lang = detect(msg)
  except: lang = "en"
  return call_model(specialists.get(lang, "gpt-4o"), msg)

Code Fragment 7.4a.1: Sketch a 6-line dispatcher that detects the language of an incoming message and routes to a language-specific model when one is available.

The edge case: code-mixed messages (e.g., Hinglish: Hindi sentences with English nouns embedded). langdetect picks one label, which routes to the wrong specialist and produces poor output. Mitigation: detect language confidence; if low (<0.7), default to the multilingual frontier model. For known code-switched user bases, train or fine-tune specifically on the mixed register.

What's Next?

In the next chapter, Section 8.1: Trading FLOPs for IQ: The Test-Time Compute Bet, we turn to inference optimization, the engineering techniques that make LLM deployment fast and cost-effective.

Further Reading

Evaluation & Analysis

Ahuja, K. et al. (2023). "MEGA: Multilingual Evaluation of Generative AI." EMNLP 2023. Comprehensive benchmark evaluating generative AI across languages, revealing significant performance gaps between high-resource and low-resource languages. Essential for understanding where current multilingual models fall short.

Wendler, C. et al. (2024). "Do Llamas Work in English? On the Latent Language of Multilingual Transformers." ACL 2024. Investigates whether multilingual transformers process non-English inputs through English-centric internal representations. Reveals that models often "think in English" even when generating other languages, with implications for cross-lingual transfer and bias.

Community-Driven Multilingual Models

Ustun, A. et al. (2024). "Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model." ACL 2024. Describes the Aya initiative, where contributors from 119 countries created instruction data for 101 languages. Demonstrates how community-driven data collection can address the resource gap for underrepresented languages in LLM training.

Aryabumi, V. et al. (2024). "Aya 23: Open Weight Releases to Further Multilingual Progress." arXiv preprint 2405.15032. Aya 23 (Cohere): a 23-language open-weight family at 8B and 35B parameters that pairs strong multilingual base pretraining with high-quality native instruction data per language; demonstrates that targeted depth often beats raw language count for end-user quality.

Workshop, B. et al. (2023). "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model." arXiv preprint arXiv:2211.05100. Documents the collaborative effort to build a 176B-parameter multilingual model covering 46 languages. Notable for its transparent governance model, data documentation, and analysis of multilingual tokenizer design trade-offs.