Part 4: Training and Adapting
Chapter 14: Fine-Tuning Fundamentals

Fine-Tuning for Representation Learning

"In embedding space, no one can hear you scream. But a well-trained contrastive model can tell the difference between a scream of joy and a scream of terror."

Finetune Finetune, Space-Screaming AI Agent
Big Picture

Embeddings are the backbone of modern search, retrieval, and recommendation systems. While off-the-shelf embedding models work well for general text, domain-specific applications often benefit enormously from fine-tuned embeddings that understand the nuances of your particular domain. A legal search engine needs embeddings that distinguish between subtly different contract clauses; a medical retrieval system needs embeddings that capture clinical relationships. This section covers why and how to fine-tune models for better representations. The embedding foundations from Section 01.2 evolved into the dense sentence embeddings that power modern retrieval systems.

Prerequisites

This section builds on fine-tuning fundamentals from Section 14.1: When and Why to Fine-Tune and data preparation covered in Section 14.2: Data Preparation for Fine-Tuning.

1. Why Fine-Tune for Representations?

Off-the-shelf embedding models like OpenAI's text-embedding-3 or open-source models like BGE and E5 are trained on broad web data. They produce good general-purpose embeddings (as explored in Chapter 19), but they may not capture the semantic distinctions that matter in your specific domain. Fine-tuning teaches the model which texts should be similar and which should be different according to your application's needs.

1.1 When Off-the-Shelf Falls Short

General embedding models struggle in several common scenarios. Domain-specific vocabulary (medical terms, legal jargon, internal company terminology) may be poorly represented. The notion of "similarity" may differ from the general case: in a customer support system, two tickets describing the same bug should be similar even if they use completely different language. In a patent search system, documents covering the same invention should cluster together despite varying levels of technical detail.

Real-World Scenario

When Generic Embeddings Failed a Legal Search Engine

Who: A senior NLP engineer at a legal technology startup building a contract clause search tool.

Situation: They used OpenAI's text-embedding-3-large for semantic search across 2 million contract clauses.

Problem: Users searching for "indemnification" clauses were getting "limitation of liability" clauses as top results. Both involve financial risk, but lawyers consider them fundamentally different.

Dilemma: They could add keyword filters (fast but brittle) or fine-tune an embedding model (slower to build but more robust). A third option was re-ranking with a cross-encoder, which was accurate but added 200ms latency per query.

Decision: They fine-tuned a Sentence Transformers model using 5,000 lawyer-annotated clause pairs indicating "same type" or "different type."

Result: Precision@10 for clause type matching improved from 67% to 91%. The fine-tuned model ran at the same speed as the generic one, unlike the cross-encoder approach.

Lesson: When your domain has specialized similarity semantics that differ from general English, fine-tuned embeddings deliver outsized returns compared to any prompt-based workaround.

Key Insight

Mental Model: The Specialized Translator. Think of fine-tuning for representations as training a translator to specialize in legal documents. The base model already speaks the language (general text understanding), but its embeddings treat medical text and legal text with the same generic understanding. Fine-tuning reshapes the embedding space so that semantically similar documents in your domain cluster together while dissimilar ones push apart. The result is a model whose internal representations are calibrated to the distinctions that matter for your specific task.

Scenario Comparison
ScenarioOff-the-Shelf PerformanceAfter Fine-TuningImprovement
General web searchNDCG@10: 0.52Not neededN/A
Medical literature retrievalNDCG@10: 0.38NDCG@10: 0.56+47%
Legal clause matchingNDCG@10: 0.31NDCG@10: 0.54+74%
Internal docs searchNDCG@10: 0.42NDCG@10: 0.61+45%
Customer support dedupF1: 0.65F1: 0.84+29%
Key Insight

The more specialized your domain, the more you benefit from fine-tuning. If your domain vocabulary overlaps heavily with general web text (e.g., product reviews, news articles), off-the-shelf embeddings will work reasonably well. But if your domain has specialized terminology, unusual notions of similarity, or if retrieval precision is critical to your application, fine-tuning can yield 30% to 70% improvements.

Why this matters: Fine-tuned embeddings are the foundation of production retrieval systems and RAG pipelines. Off-the-shelf embedding models capture general semantic similarity, but they often fail on domain-specific concepts where terms have specialized meanings (for example, "discharge" means very different things in medical, military, and electrical contexts). Fine-tuning an embedding model on your domain data can improve retrieval precision by 10% to 30%, which cascades into significantly better RAG answers.

Common Mistake: Fine-Tuning Embeddings Without Re-Indexing

When you fine-tune an embedding model, the entire vector space changes. All previously computed embeddings become incompatible with the new model. If you have a vector database with millions of documents indexed using the old model, you must re-embed and re-index the entire collection after fine-tuning. Forgetting this step produces silently degraded search quality because queries (embedded with the new model) are being compared against documents (embedded with the old model) in misaligned vector spaces. Budget for re-indexing time and compute cost when planning embedding fine-tuning projects.

2. Encoder-Only vs. Decoder-Only for Embeddings

Historically, encoder-only models (BERT, RoBERTa) dominated the embedding space because their bidirectional architecture naturally produces rich token representations. Decoder-only models (GPT, Llama) are autoregressive and were not originally designed for embeddings. However, recent work has shown that decoder-only models can produce competitive embeddings with the right training approach. Figure 14.5.1 compares the two embedding extraction strategies.

Encoder-Only (BERT, BGE) [CLS] The patient has fever [SEP] Bidirectional attention (all tokens attend to all other tokens) [CLS] token = embedding Native fit for embeddings Models: 100M to 400M params Decoder-Only (Llama, Mistral) The patient has fever <eos> Causal attention (each token attends only to previous tokens) Last token = embedding Requires adaptation for embeddings Models: 1B to 70B params
Figure 14.5.1: Encoder models use [CLS] or mean pooling; decoder models typically use the last token as the sentence embedding
Aspect Comparison
AspectEncoder-OnlyDecoder-Only
ArchitectureBidirectional (BERT, RoBERTa)Causal/autoregressive (Llama, Mistral)
Pooling strategy[CLS] token or mean poolingLast token or mean pooling
Typical model size100M to 400M parameters1B to 70B parameters
Embedding dimension768 to 10242048 to 8192
Max sequence length512 tokens (typical)4K to 128K tokens
Inference speedFast (small model)Slower (large model)
Quality for retrievalExcellent with fine-tuningCompetitive with fine-tuning
Best forHigh-throughput retrievalWhen you already have the model deployed

3. Contrastive Learning for Embeddings

The standard approach for fine-tuning embedding models is contrastive learning. The core idea is simple: train the model so that embeddings of semantically similar texts are close together, while embeddings of dissimilar texts are far apart. Training pairs can be constructed manually or generated using synthetic data techniques, and the model is optimized with a contrastive loss function. Code Fragment 14.5.3 shows this approach in practice.

3.1 Training Data: Pairs and Triplets

Code Fragment 14.5.2 loads the model via Transformers.

# Preparing contrastive training data
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class ContrastivePair:
 """A pair of texts with a similarity label."""
 anchor: str # The query or reference text
 positive: str # Text that should be similar to anchor
 negative: str = None # Text that should be different (for triplet loss)
 score: float = 1.0 # Similarity score (0 to 1) for soft labels

# Example: medical retrieval training data
medical_pairs = [
 ContrastivePair(
 anchor="What are the symptoms of type 2 diabetes?",
 positive="Type 2 diabetes symptoms include increased thirst, frequent "
 "urination, blurred vision, fatigue, and slow wound healing.",
 negative="Type 1 diabetes is an autoimmune condition where the immune "
 "system attacks insulin-producing beta cells in the pancreas."
 ),
 ContrastivePair(
 anchor="Treatment options for hypertension",
 positive="First-line treatments for high blood pressure include ACE "
 "inhibitors, ARBs, calcium channel blockers, and thiazide "
 "diuretics, often combined with lifestyle modifications.",
 negative="Hypotension, or low blood pressure, is typically treated by "
 "increasing fluid intake and wearing compression stockings."
 ),
]

# Convert to the format expected by Sentence Transformers
def pairs_to_dataset(pairs: List[ContrastivePair]) -> dict:
 """Convert contrastive pairs to training format."""
 anchors = [p.anchor for p in pairs]
 positives = [p.positive for p in pairs]
 negatives = [p.negative for p in pairs if p.negative]

 if negatives:
 return {
 "anchor": anchors,
 "positive": positives,
 "negative": negatives,
 }
 return {
 "anchor": anchors,
 "positive": positives,
 }
recommendation: fine-tune baseline: 0.38 target: 0.55 gap: 0.17000000000000004 reasons: ['Expected NDCG after fine-tuning: ~0.63', 'Reindexing will take ~3.5 hours for 500,000 documents']
Code Fragment 14.5.1: Preparing contrastive training data
Library Shortcut: Fine-Tuning Embeddings with sentence-transformers

The sentence-transformers library (pip install sentence-transformers) provides the most streamlined way to fine-tune embedding models with contrastive learning. It handles batching, loss computation, and in-batch negative mining automatically.

# pip install sentence-transformers
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

model = SentenceTransformer("BAAI/bge-base-en-v1.5")

train_examples = [
 InputExample(texts=[pair.anchor, pair.positive, pair.negative])
 for pair in medical_pairs
]
loader = DataLoader(train_examples, batch_size=32, shuffle=True)

# MultipleNegativesRankingLoss uses in-batch negatives
loss = losses.MultipleNegativesRankingLoss(model)

model.fit(
 train_objectives=[(loader, loss)],
 epochs=2,
 warmup_steps=100,
 output_path="./medical-embeddings",
)
Code Fragment 14.5.2: Fine-tuning a sentence embedding model with sentence-transformers. The CosineSimilarityLoss trains the model to produce embeddings where similar pairs have high cosine similarity and dissimilar pairs have low similarity.

Code Fragment 14.5.3 provides a decision helper that weighs baseline performance gaps, corpus size, available training pairs, and reindexing costs to recommend whether embedding fine-tuning is worth the investment.

# Practical: deciding whether to fine-tune embeddings
def should_finetune_embeddings(
 baseline_ndcg: float,
 target_ndcg: float,
 corpus_size: int,
 num_training_pairs: int,
 reindex_cost_hours: float,
) -> dict:
 """Decision helper for embedding fine-tuning."""
 gap = target_ndcg - baseline_ndcg
 has_enough_data = num_training_pairs >= 1000
 gap_is_significant = gap > 0.05

 recommendation = "off-the-shelf"
 reasons = []

 if not gap_is_significant:
 reasons.append("Gap to target is small (<5%); fine-tuning unlikely to help")
 elif not has_enough_data:
 reasons.append("Need at least 1,000 training pairs; consider generating "
 "synthetic pairs with an LLM")
 recommendation = "generate_data_first"
 else:
 expected_improvement = min(gap * 1.5, 0.25) # Conservative estimate
 expected_ndcg = baseline_ndcg + expected_improvement
 if expected_ndcg >= target_ndcg:
 recommendation = "fine-tune"
 reasons.append(f"Expected NDCG after fine-tuning: ~{expected_ndcg:.2f}")
 else:
 recommendation = "fine-tune + improve retrieval pipeline"
 reasons.append("Fine-tuning alone may not close the gap; "
 "consider hybrid retrieval (BM25 + dense)")

 reasons.append(f"Reindexing will take ~{reindex_cost_hours:.1f} hours "
 f"for {corpus_size:,} documents")

 return {
 "recommendation": recommendation,
 "baseline": baseline_ndcg,
 "target": target_ndcg,
 "gap": gap,
 "reasons": reasons,
 }

result = should_finetune_embeddings(
 baseline_ndcg=0.38,
 target_ndcg=0.55,
 corpus_size=500_000,
 num_training_pairs=5_000,
 reindex_cost_hours=3.5,
)
for k, v in result.items():
 print(f" {k}: {v}")
Code Fragment 14.5.3: Practical: deciding whether to fine-tune embeddings
Self-Check
Q1: Why do specialized domains benefit more from fine-tuned embeddings than general-purpose domains?
Show Answer
Off-the-shelf embedding models are trained on broad web data, which means they understand general-purpose notions of text similarity well. Specialized domains, however, have unique vocabulary, abbreviations, and semantic relationships that are underrepresented in the training data. For example, in medical text, "MI" means myocardial infarction, not a state abbreviation. Fine-tuning teaches the model these domain-specific semantic distinctions, leading to much larger improvements in specialized domains than in general ones.
Q2: What is the advantage of using MultipleNegativesRankingLoss over basic triplet loss?
Show Answer
MultipleNegativesRankingLoss treats all other examples in the batch as negatives, providing many more negative examples per training step. With a batch size of 32, each anchor gets 31 negatives instead of just 1. This creates a harder and more informative training signal, leading to better embeddings with fewer training examples. It also eliminates the need to explicitly mine hard negatives in your training data.
Q3: What pooling strategy is typically used for decoder-only models when computing sentence embeddings?
Show Answer
Decoder-only models typically use last-token pooling, where the hidden state of the final token (usually the EOS token) serves as the sentence embedding. Because of the causal attention mask, only the last token has "seen" all previous tokens in the sequence, making it the most informationally rich representation of the entire input. Some approaches also use mean pooling over all token representations, which can work but requires careful handling of the attention mask.
Q4: What practical cost must you account for when switching to fine-tuned embeddings in a production system?
Show Answer
You must recompute and reindex all embeddings in your vector database. Since the fine-tuned model produces different embeddings than the original model, all previously stored vectors become stale and incompatible with queries encoded by the new model. For large corpora (millions of documents), this can require hours of compute time and careful orchestration to avoid downtime. You need a reindexing pipeline that can handle this process, ideally with the ability to run incrementally or with blue-green deployment.
Q5: A team has 200 query-document pairs for a legal search system. Is this enough to fine-tune an embedding model?
Show Answer
200 pairs is generally not enough for effective embedding fine-tuning. The minimum recommended is approximately 1,000 pairs, with 5,000 to 10,000 being ideal. With only 200 pairs, the team should first generate synthetic training pairs using an LLM (e.g., prompting GPT-4 to create query-passage pairs from their legal corpus), then combine those with the 200 real pairs for fine-tuning. They should also benchmark off-the-shelf models first to confirm that fine-tuning is actually needed for their use case.
Tip: Save Checkpoints Frequently

Save a checkpoint at least every epoch, and keep the best 3 by validation loss. Fine-tuning runs are short but the optimal stopping point is hard to predict. Having checkpoints lets you recover the best model without re-running.

Key Takeaways
Real-World Scenario: Fine-Tuning Embeddings for Patent Prior Art Search

Who: A search engineering team at an intellectual property analytics firm building a prior art search engine over 8 million patent documents.

Situation: Their search system used off-the-shelf BGE-large embeddings for semantic search. Patent examiners reported that the system missed relevant prior art in 35% of test queries because general-purpose embeddings did not capture patent-specific similarity (e.g., that "fastening mechanism" and "bolt assembly" describe related inventions).

Problem: Patent language is highly specialized, with domain-specific synonyms, technical jargon, and a unique notion of "similarity" based on functional equivalence rather than lexical overlap. Off-the-shelf embeddings were trained on web text and did not capture these relationships.

Dilemma: They could use BM25 keyword search as a fallback (misses semantic matches), expand queries with an LLM (adds latency, expensive at scale), or fine-tune the embedding model on patent-specific similarity pairs.

Decision: They fine-tuned BGE-large using contrastive learning on 25,000 patent similarity pairs generated from patent citation networks (if patent A cites patent B, they are a positive pair) and examiner relevance judgments from their historical search logs.

How: They used MultipleNegativesRankingLoss with in-batch negatives, training for 5 epochs on a single A100 GPU (8 hours). Hard negatives were patents from the same IPC class that were not cited by the query patent. After training, they reindexed all 8 million patents (a 36-hour batch job).

Result: Recall@20 improved from 65% to 84% on a test set of 1,000 examiner-validated queries. The fine-tuned model correctly captured patent-specific similarity: "fastening mechanism" now retrieved "bolt assembly," "adhesive bonding system," and "clip attachment device." The reindexing cost was $800 in compute, and ongoing inference cost was identical to the original model.

Lesson: Fine-tuned embeddings are essential when your domain has a specialized notion of similarity that differs from general text; citation networks and historical relevance judgments are excellent sources of training signal for domain-specific embedding models.

Research Frontier

Contrastive fine-tuning methods like GISTEmbed and instructor-based approaches are producing task-aware embeddings that outperform general-purpose embedding models on domain-specific retrieval. Research on matryoshka representation learning creates embeddings where subsets of dimensions form valid lower-dimensional representations, enabling flexible accuracy-latency tradeoffs at inference time.

The frontier is learning representations that capture both semantic similarity and structured relational knowledge (such as hierarchical or causal relationships) in a single embedding space.

Exercises

Exercise 14.5.1: Representation learning goal Conceptual

Explain the difference between fine-tuning for generation (SFT) and fine-tuning for representation learning. What is the output of a representation model?

Answer Sketch

SFT trains the model to generate text sequences; the output is generated tokens. Representation fine-tuning trains the model to produce meaningful embedding vectors; the output is a fixed-size dense vector for each input. The embedding should place semantically similar inputs near each other in vector space. Representation models are used for search, retrieval, clustering, and as features for downstream classifiers.

Exercise 14.5.2: Contrastive loss Conceptual

Describe how contrastive learning works for training embedding models. What are positive and negative pairs, and why is the choice of negatives important?

Answer Sketch

Contrastive learning pulls positive pairs (semantically similar) together and pushes negative pairs (dissimilar) apart in embedding space. Positive pairs: a question and its relevant document. Negatives: a question paired with an irrelevant document. Hard negatives (documents that are similar but not relevant) are crucial because they force the model to learn fine-grained distinctions. Easy negatives (completely unrelated documents) provide little learning signal since the model already separates them.

Exercise 14.5.3: Matryoshka embeddings Coding

Explain the Matryoshka Representation Learning (MRL) technique. Write pseudocode for a training loop that produces embeddings usable at multiple dimensionalities (64, 128, 256, 768).

Answer Sketch

MRL trains the model so that the first N dimensions of the embedding are useful on their own, for any N. Training: for each batch, compute the full embedding, then apply the contrastive loss at multiple truncation points: for dim in [64, 128, 256, 768]: loss += contrastive_loss(embeddings[:, :dim], labels). At inference, truncate to the desired dimensionality based on storage/speed requirements. Lower dims trade quality for 4 to 12x space savings.

Exercise 14.5.4: Domain-adapted embeddings Analysis

A legal search engine uses a general-purpose embedding model but struggles with queries about specific contract clauses. How would you fine-tune the embedding model for this domain? What training data would you need?

Answer Sketch

Collect training pairs: (query, relevant contract clause) as positives. Generate hard negatives: clauses that mention similar terms but answer different questions. Fine-tune with contrastive loss for 1 to 3 epochs with a low learning rate (1e-5). Evaluate on a held-out set of legal queries with known relevant clauses using recall@k and NDCG. The key is hard negatives from the legal domain, which teach the model to distinguish between similar but legally distinct concepts.

Exercise 14.5.5: Evaluation metrics Coding

Write a function that evaluates embedding quality on an information retrieval task. Compute recall@1, recall@5, recall@10, and Mean Reciprocal Rank (MRR) given queries, document embeddings, and relevance labels.

Answer Sketch

For each query: compute cosine similarity with all documents, rank by similarity. recall_at_k = 1 if any relevant doc in top-k else 0. reciprocal_rank = 1/rank_of_first_relevant_doc. Average across all queries for each metric. Use numpy: sims = query_emb @ doc_embs.T; ranked = np.argsort(-sims). Report all four metrics; MRR is the most informative single metric for ranking quality.

What Comes Next

In the next section, Section 14.6: Fine-Tuning for Classification & Sequence Tasks, we cover fine-tuning for classification and sequence labeling tasks. The contrastive training data preparation connects to embedding training methods in Section 19.2 and the data curation pipelines in Section 06.4, the most common supervised NLP applications.

Fun Fact

Fine-tuning embeddings for a specific domain is like recalibrating a telescope for a particular region of the sky. The general-purpose instrument works fine for stargazing, but the specialized one reveals details that were previously invisible. The catch? Every time you recalibrate, you need to re-photograph the entire sky (reindex your corpus).

References and Further Reading
Contrastive Learning Foundations

Reimers, N. & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.

The foundational paper for efficient sentence embedding training using Siamese and triplet network architectures. Sentence-BERT established the training paradigm that all modern embedding fine-tuning builds upon. Required reading for understanding the contrastive learning approach covered in this section.

Paper

Gao, T. et al. (2021). SimCSE: Simple Contrastive Learning of Sentence Embeddings. EMNLP 2021.

Introduces a remarkably simple contrastive approach that uses dropout as the only data augmentation, producing state-of-the-art embeddings. SimCSE demonstrates that effective contrastive learning does not require complex augmentation strategies. Valuable for understanding the design principles behind modern embedding training.

Paper

Wang, L. et al. (2024). Text Embeddings by Weakly-Supervised Contrastive Pre-training.

Presents E5, a family of embedding models trained with weakly-supervised contrastive learning on large-scale text pairs. The paper demonstrates how to scale embedding training beyond manually curated datasets. Relevant for teams considering large-scale embedding pre-training before domain-specific fine-tuning.

Paper
Benchmarks and Applications

Muennighoff, N. et al. (2023). MTEB: Massive Text Embedding Benchmark. EACL 2023.

The standard benchmark for evaluating embedding model quality across retrieval, classification, clustering, and semantic similarity tasks. Use MTEB to measure the impact of your fine-tuning and compare against baselines. Indispensable for the evaluation workflow discussed in this section.

Benchmark

Xiao, S. et al. (2023). C-Pack: Packaged Resources To Advance General Chinese Embedding.

Introduces BGE (BAAI General Embedding), a family of high-performing multilingual embedding models with detailed training recipes. The BGE training methodology provides a practical blueprint for fine-tuning embeddings in non-English domains. Useful for teams working with multilingual or domain-specific embedding requirements.

Paper

Henderson, M. et al. (2017). Efficient Natural Language Response Suggestion for Smart Reply.

Describes Google's Smart Reply system, which uses learned embeddings to match incoming messages with suggested responses at scale. This is a practical case study of embedding fine-tuning for a production retrieval application. Recommended for understanding how fine-tuned embeddings power real-world systems.

Paper