"Why replace the whole engine when you can just upgrade the fuel? Embeddings are the premium octane that makes your trusty XGBoost purr."
Label, Fuel-Injected AI Agent
The best of both worlds. Instead of choosing between an LLM and a classical model, you can use the LLM as a feature extractor and feed its outputs into a traditional ML pipeline. LLM embeddings capture deep semantic meaning that TF-IDF cannot, while the downstream classical model (XGBoost, logistic regression, neural network) provides fast inference, low cost, and full interpretability. This pattern is particularly powerful when you want LLM-quality understanding at classical-ML prices, or when you need to combine text understanding with structured features that LLMs handle poorly. The word embedding concepts from Section 01.2 evolved into the dense representations that power this approach.
Prerequisites
This section builds on the decision framework from Section 12.1. Familiarity with embedding representations from Section 01.3 and the API patterns for extracting embeddings from Section 10.1 is important. The structured output techniques from Section 10.2 explain how LLM-generated features can be reliably parsed into tabular formats.
1. Embeddings as Features
This section uses embeddings extensively. If you need a refresher on how word and sentence embeddings work, see Chapter 01 (Text Representation) for foundational NLP concepts and Chapter 07 for how pretrained models learn these representations. Here we focus on using embeddings as features for downstream ML models.
Every LLM (and many smaller language models) can produce embeddings: dense vector representations that encode the semantic meaning of text. These embeddings serve as drop-in replacements for hand-crafted features like TF-IDF or bag-of-words, and they consistently outperform them on tasks that require understanding meaning rather than just matching keywords.
1.1 Why Embeddings Beat TF-IDF
Why this pattern is so powerful: The LLM-as-feature-extractor pattern gives you the best of both worlds: the deep semantic understanding of a large language model combined with the speed, cost, and interpretability of a classical classifier. The key insight is that LLM embeddings are a one-time computation. You pay the LLM cost once to generate the embedding, then the resulting vector can be used for any number of downstream tasks (classification, clustering, similarity search, anomaly detection) at essentially zero marginal cost. This amortization is why embedding-based pipelines can achieve LLM-quality understanding at 1/100th the per-prediction cost of calling an LLM for each query. The embedding approach also connects directly to the vector databases and RAG systems covered in Part 5.
TF-IDF represents text as sparse vectors based on word frequencies. It captures lexical overlap but completely misses semantic similarity. The sentences "The car is fast" and "The automobile has high velocity" share zero TF-IDF features despite being semantically identical. Embeddings from a language model map both sentences to nearby points in a dense vector space, because the model has learned that "car" and "automobile," "fast" and "high velocity" are semantically related. We explore this vector space geometry in more detail in Chapter 19, where embeddings power retrieval systems.
The classic "king minus man plus woman equals queen" analogy from Word2Vec still works with modern LLM embeddings, but the geometry has gotten richer. Modern embedding spaces capture subtle relationships that early models missed entirely: sarcasm versus sincerity, formal versus casual register, and even the difference between "I am fine" (genuinely fine) and "I am fine" (definitely not fine). Context, it turns out, is everything.
The tradeoff is computational cost. Computing a TF-IDF vector requires only a dictionary lookup and some arithmetic. Computing an embedding requires a forward pass through a neural network. However, this cost is paid only once: you can precompute embeddings for your entire dataset and then use the resulting vectors with any classical model at near-zero marginal cost. Figure 12.2.1 shows this end-to-end pipeline.
1.2 Generating Embeddings with OpenAI
The following snippet demonstrates how to generate embeddings for a batch of text inputs using OpenAI's embedding API, then inspects the output shape and cost. Code Fragment 12.2.4 shows this approach in practice.
# Generate text embeddings via OpenAI's API for downstream ML tasks.
# Batches multiple texts in a single call for efficiency.
import openai
import numpy as np
# Initialize OpenAI client (reads OPENAI_API_KEY from env)
client = openai.OpenAI()
def get_embeddings(texts: list[str], model: str = "text-embedding-3-small") -> np.ndarray:
"""Get embeddings for a batch of texts.
For large datasets, split into chunks of ~2000 texts
to stay within the API's per-request token limit.
"""
response = client.embeddings.create(
input=texts,
model=model
)
return np.array([item.embedding for item in response.data])
# Example: embed customer support tickets
tickets = [
"I was charged twice for my subscription last month",
"The app crashes every time I try to upload a photo",
"How do I update my billing address?",
"My package was delivered to the wrong address",
"Can you explain your enterprise pricing plans?",
]
embeddings = get_embeddings(tickets)
print(f"Embedding shape: {embeddings.shape}")
print(f"Embedding dtype: {embeddings.dtype}")
print(f"First embedding (first 5 dims): {embeddings[0][:5]}")
print(f"\nCost: ~$0.00002 per text (text-embedding-3-small)")
print(f"Total for 5 texts: ~$0.0001")
For local inference without API costs, Code Fragment 12.2.4 uses the SentenceTransformers library to generate embeddings on your own hardware.
# Generate sentence embeddings for semantic similarity comparison
# The bi-encoder produces fixed-size vectors for fast nearest-neighbor search
from sentence_transformers import SentenceTransformer
import numpy as np
import time
# Load a local embedding model
model = SentenceTransformer('all-MiniLM-L6-v2') # 80MB, fast
texts = [
"I was charged twice for my subscription",
"The app crashes every time I try to upload a photo",
"How do I update my billing address?",
"My package was delivered to the wrong address",
"Can you explain your enterprise pricing plans?",
]
# Generate embeddings locally
start = time.perf_counter()
embeddings = model.encode(texts, normalize_embeddings=True)
elapsed = time.perf_counter() - start
print(f"Model: all-MiniLM-L6-v2")
print(f"Embedding shape: {embeddings.shape}")
print(f"Time for 5 texts: {elapsed*1000:.1f} ms")
print(f"Avg per text: {elapsed/len(texts)*1000:.1f} ms")
print(f"Cost: $0.00 (local inference)")
# Compute similarity matrix
similarity = np.dot(embeddings, embeddings.T)
print(f"\nSimilarity between 'charged twice' and 'billing address': "
f"{similarity[0][2]:.3f}")
print(f"Similarity between 'charged twice' and 'app crashes': "
f"{similarity[0][1]:.3f}")
4. Combining Embeddings with Structured Features
The most powerful pattern combines LLM embeddings (capturing text semantics) with traditional structured features (capturing numeric and categorical data) in a single model. This is especially effective for tasks where both text and metadata carry predictive signal, such as support ticket prioritization, product recommendation, and content moderation. Code Fragment 12.2.4 shows this approach in practice.
# Feature ablation: compare structured-only, embeddings-only, and combined features.
# Demonstrates that combining both sources outperforms either alone.
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
np.random.seed(42)
n_samples = 1000
# Simulate embeddings (384-dim from local model)
text_embeddings = np.random.randn(n_samples, 384) * 0.1
# Simulate structured features
structured_features = np.column_stack([
np.random.randint(0, 365, n_samples), # account_age_days
np.random.randint(0, 50, n_samples), # previous_tickets
np.random.uniform(0, 5000, n_samples), # account_value
np.random.choice([0, 1], n_samples), # is_premium
np.random.uniform(0, 1, n_samples), # sentiment_score
])
# Add predictive signal
labels = (
(structured_features[:, 3] == 1).astype(float) * 0.3 + # premium
(structured_features[:, 2] > 2500).astype(float) * 0.3 + # high value
np.random.randn(n_samples) * 0.2
) > 0.3
labels = labels.astype(int)
# Three feature configurations
configs = {
"Structured only": structured_features,
"Embeddings only": text_embeddings,
"Combined (structured + embeddings)": np.hstack([
StandardScaler().fit_transform(structured_features),
text_embeddings
]),
}
model = xgb.XGBClassifier(
n_estimators=100, max_depth=4, learning_rate=0.1,
eval_metric='logloss'
)
print("Feature ablation study (5-fold CV accuracy):")
print("=" * 55)
for name, features in configs.items():
scores = cross_val_score(model, features, labels, cv=5)
print(f" {name:40s} {scores.mean():.3f} (+/- {scores.std():.3f})")
Embeddings and structured features are complementary, not competing. Think of it like a medical diagnosis: structured features (age, blood pressure, lab results) provide the hard numbers, while embeddings of the patient's description capture the softer signals (symptom severity, emotional context, narrative coherence). Neither alone tells the complete story. The combined feature matrix consistently outperforms either source in isolation because each captures a different type of information about the input.
When combining embeddings with structured features, always standardize (zero mean, unit variance) the structured features. Embedding vectors from language models are typically already normalized or on a consistent scale, but structured features like "account_value" (range 0 to 5000) and "is_premium" (0 or 1) need to be rescaled so that the gradient-based optimizer does not disproportionately weight high-magnitude features. Tree-based models like XGBoost are less sensitive to feature scaling, but it is still good practice.
4.1 Semantic Caching as a Hybrid Pattern
One of the most cost-effective hybrid patterns combines embeddings with a cache layer to avoid redundant LLM calls. In a typical application, many user queries are semantically equivalent even though they differ in wording: "What is your return policy?" and "How do I return an item?" should produce the same response. Semantic caching intercepts incoming queries, embeds them, and checks whether a sufficiently similar query has been answered recently. If a cached response exists within a configurable similarity threshold, it is returned immediately, bypassing the LLM entirely.
The architecture is straightforward. An embedding model (local or API-based) converts each incoming query into a vector. This vector is compared against a vector index of recent query embeddings using cosine similarity. If the nearest neighbor exceeds the similarity threshold (typically 0.92 to 0.97), the cached response is returned. Otherwise, the query is forwarded to the LLM, and both the query embedding and the response are stored in the cache for future reuse. Code Fragment 12.2.4 shows this in practice.
Code Fragment 12.2.4 defines a reusable class.
# Semantic cache: avoid redundant LLM calls by matching new queries
# against cached embeddings using cosine similarity.
import numpy as np
from openai import OpenAI
client = OpenAI()
class SemanticCache:
def __init__(self, threshold=0.95, max_entries=10000):
self.threshold = threshold
self.max_entries = max_entries
self.embeddings = [] # List of numpy arrays
self.responses = [] # Corresponding cached responses
def _embed(self, text):
resp = client.embeddings.create(
model="text-embedding-3-small", input=text
)
return np.array(resp.data[0].embedding)
def get_or_generate(self, query, generate_fn):
query_vec = self._embed(query)
# Check cache for semantic match
if self.embeddings:
sims = np.dot(np.array(self.embeddings), query_vec)
best_idx = np.argmax(sims)
if sims[best_idx] >= self.threshold:
return self.responses[best_idx] # Cache hit
# Cache miss: generate and store
response = generate_fn(query)
self.embeddings.append(query_vec)
self.responses.append(response)
# Evict oldest entries if cache is full
if len(self.embeddings) > self.max_entries:
self.embeddings.pop(0)
self.responses.pop(0)
return response
SemanticCache.get_or_generate() method embeds incoming queries, compares against stored vectors at a configurable threshold (default 0.95), and returns cached responses on hits, bypassing the LLM entirely.The cost savings from semantic caching can be dramatic. In customer support applications where 30% to 50% of queries are paraphrases of common questions, semantic caching reduces LLM API costs proportionally while cutting median response latency from 1 to 2 seconds (LLM generation) to under 50 milliseconds (vector lookup). The embedding cost for the cache lookup is negligible: a single embedding API call costs roughly 1,000x less than a full LLM generation. For even lower latency, a local embedding model like all-MiniLM-L6-v2 can handle the cache lookup in under 5 milliseconds on CPU.
Semantic caches require careful threshold tuning and TTL (time-to-live) policies. A threshold too low (say, 0.85) returns cached responses for queries that are similar in topic but different in intent, producing incorrect answers. A threshold too high (0.99) rarely matches, eliminating the benefit. Start with 0.95 and evaluate on a sample of query pairs that should and should not match. Also set a TTL so that cached responses to questions about prices, availability, or policies expire when the underlying information changes.
5. Dimensionality Reduction for Embeddings
High-dimensional embeddings (1536 dimensions from OpenAI, 768 or 1024 from many open models) can cause issues with some classical models. Logistic regression may overfit without strong regularization. Tree models may struggle with the high dimensionality. Dimensionality reduction techniques like PCA or UMAP can compress embeddings while preserving most of the information.
| Embedding Model | Dimensions | Quality (MTEB) | Cost | Speed |
|---|---|---|---|---|
| text-embedding-3-large | 3072 | Highest | $0.00013/1K tokens | API latency |
| text-embedding-3-small | 1536 | High | $0.00002/1K tokens | API latency |
| all-MiniLM-L6-v2 | 384 | Good | Free (local) | ~5ms/text (GPU) |
| bge-large-en-v1.5 | 1024 | High | Free (local) | ~15ms/text (GPU) |
| nomic-embed-text-v1.5 | 768 | High | Free (local) | ~8ms/text (GPU) |
Always use the same embedding model for both training and inference. Embeddings from different models live in different vector spaces and are not interchangeable. If you train your XGBoost classifier on OpenAI embeddings, you cannot switch to a local sentence-transformer model at inference time without retraining. Plan your embedding strategy before building the pipeline, considering both quality and long-term operational costs.
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Build a lightweight classifier (or simple rule set) that routes easy queries to small models and hard queries to large ones. In practice, 70 to 80% of production queries are straightforward and can be handled by the cheaper model.
- LLM embeddings serve as powerful drop-in replacements for TF-IDF, capturing semantic meaning that sparse representations miss.
- The embedding pipeline (compute once, reuse many times) gives you LLM-quality text understanding at near-zero marginal inference cost.
- LLM-powered feature engineering extracts interpretable, structured attributes from text that can be combined with traditional features in a single model.
- Enriching sparse structured data with LLM-generated text descriptions and embeddings can significantly improve classical model performance.
- Local embedding models (sentence-transformers) eliminate API costs and latency for high-volume production systems.
- Always use the same embedding model for training and inference; vectors from different models are not compatible.
Who: A search team at an online marketplace with 3 million product listings and 500,000 daily search queries.
Situation: Their search ranking model used TF-IDF features combined with click-through signals in a LambdaMART model. It performed well for exact keyword matches but struggled with semantic queries like "cozy reading light" or "gift for a coffee lover."
Problem: Semantic search failures caused 18% of queries to return zero relevant results in the top 5, driving users to competitor platforms. Adding synonyms manually was unsustainable with 50,000 new products per month.
Dilemma: They could deploy a full neural search system (high infrastructure cost, retraining needed), use LLM embeddings as additional features in their existing ranker (moderate effort, leverages existing infrastructure), or call an LLM at query time for query expansion (high latency, high cost at scale).
Decision: They chose to add sentence-transformer embeddings as features in their existing LambdaMART model, computing embeddings offline for all products and at query time for search queries using a local model.
How: They deployed the all-MiniLM-L6-v2 model locally, pre-computed embeddings for all 3 million products (a one-time batch job taking 4 hours on a single GPU), and added the cosine similarity between query and product embeddings as a new feature to the ranker. The embedding was also used to generate "semantic match" features for the top 100 candidates retrieved by BM25.
Result: Zero-result queries dropped from 18% to 4%. NDCG@10 improved by 12% across all queries and by 31% for semantic queries specifically. The local embedding model added only 3ms of latency per query, and GPU hosting cost $200/month.
Lesson: LLM embeddings as features in existing models deliver most of the benefit of neural search without requiring a full infrastructure overhaul; pre-computing embeddings offline keeps inference costs near zero.
Using an LLM as a feature extractor is conceptually similar to how early deep learning practitioners used pretrained ImageNet models as feature extractors for unrelated vision tasks. The principle is the same: large models learn general representations that transfer surprisingly well to specialized downstream tasks.
Embedding model specialization. Purpose-built embedding models (Nomic Embed, Jina Embeddings v3, Cohere Embed v3) are outperforming general-purpose LLM embeddings on retrieval and classification tasks while being 10 to 100x cheaper to run. The embedding landscape is covered in depth in Section 19.1.
LLM-as-judge for feature quality. Using one LLM to evaluate the quality of features generated by another is emerging as a scalable quality assurance pattern, reducing the need for human annotation in feature validation pipelines.
Structured generation for tabular features. Combining structured output (from Section 10.2) with feature extraction enables LLMs to produce validated, typed feature vectors that integrate directly into ML training pipelines without manual parsing.
Exercises
Explain why LLM embeddings capture semantic similarity that TF-IDF misses. Give an example of two sentences that have high embedding similarity but zero TF-IDF overlap.
Answer Sketch
'The car is fast' and 'The automobile has high velocity' share no words (after removing stopwords) so TF-IDF similarity is zero. LLM embeddings map both to nearby vectors because the model learned that 'car'/'automobile' and 'fast'/'high velocity' are synonymous. Embeddings encode meaning, while TF-IDF encodes surface-level word co-occurrence.
Write Python code that extracts embeddings from customer reviews using an embedding API, then trains an XGBoost classifier on those embeddings with a sentiment label. Include train/test split.
Answer Sketch
Use openai.embeddings.create(input=texts, model='text-embedding-3-small') to get embeddings. Stack into a NumPy array. Split with train_test_split(X, y, test_size=0.2). Train: clf = XGBClassifier(); clf.fit(X_train, y_train). Evaluate: accuracy_score(y_test, clf.predict(X_test)). This combines LLM-quality text understanding with fast, cheap XGBoost inference.
Beyond embeddings, LLMs can generate structured features from text (e.g., extracting 'urgency level' from a support ticket). Compare the tradeoffs of using embeddings vs. LLM-generated features as inputs to a classical model.
Answer Sketch
Embeddings: fast to compute (single API call per text), dense vectors that capture general semantics, but are opaque and not interpretable. LLM-generated features: slower (requires a generation call per text), produce interpretable features (e.g., urgency=high), but are more expensive and may hallucinate values. Use embeddings for high-volume batch processing; use LLM-generated features when interpretability is required or specific domain attributes matter.
LLM embeddings are typically 1536 or 3072 dimensions. Write code that applies PCA to reduce embedding dimensionality to 256, then compares classifier accuracy before and after reduction.
Answer Sketch
Use from sklearn.decomposition import PCA. Fit PCA on training embeddings: pca = PCA(n_components=256); X_train_pca = pca.fit_transform(X_train). Transform test set: X_test_pca = pca.transform(X_test). Train two classifiers (full and reduced) and compare accuracy. Typically, PCA to 256 dims retains 95%+ of variance and loses less than 1% accuracy while reducing memory and training time by 6x.
You need to embed 1 million customer reviews (average 200 tokens each) using text-embedding-3-small at $0.02 per million tokens. Calculate the total cost and compare it to running a local sentence-transformers model on a single GPU.
Answer Sketch
API cost: 1M reviews * 200 tokens = 200M tokens. At $0.02/1M tokens = $4.00 total. Local model: a sentence-transformers model on one GPU processes roughly 1000 reviews/second, taking about 17 minutes. GPU cost at $1/hour = ~$0.28. The API is simpler but 14x more expensive. At this scale, local is cheaper; at smaller scales (<100K), the API wins on simplicity.
What Comes Next
In the next section, Section 12.3: Hybrid Pipeline Patterns, we examine hybrid pipeline patterns that combine LLMs with traditional components for production-grade systems.
Introduces Sentence-BERT, the foundational technique for generating semantically meaningful sentence embeddings efficiently. This paper is the starting point for understanding how LLM embeddings can serve as features for downstream classical models. Essential for anyone implementing the embedding-as-feature pattern.
Sentence-Transformers. (2024). Sentence-Transformers Documentation.
The official documentation for the sentence-transformers library, which provides pre-trained models and utilities for generating embeddings. Includes tutorials on fine-tuning, model selection, and integration with downstream tasks. The go-to practical resource for implementing the code examples in this section.
Neelakantan, A. et al. (2022). Text and Code Embeddings by Contrastive Pre-Training.
Describes OpenAI's approach to training general-purpose text and code embeddings using contrastive learning. The paper demonstrates how large-scale contrastive pre-training produces embeddings that transfer well across tasks. Valuable for understanding the quality gap between API-based and open-source embedding models.
Muennighoff, N. et al. (2023). MTEB: Massive Text Embedding Benchmark. EACL 2023.
Presents the most comprehensive benchmark for text embedding models, evaluating across classification, clustering, retrieval, and semantic similarity tasks. Use this to select the right embedding model for your specific downstream task. Indispensable for practitioners comparing embedding quality before committing to a model.
OpenAI. (2024). Embeddings Guide.
OpenAI's official guide covering embedding model selection, dimensionality, pricing, and best practices for common use cases like search, clustering, and classification. Particularly helpful for teams using the OpenAI API as their embedding provider and wanting to optimize cost and quality.
A thorough overview of learning-to-rank algorithms that form the classical ML backbone of many hybrid search systems. Understanding LambdaMART is key to building reranking pipelines that combine LLM embeddings with traditional ranking features. Recommended for engineers building hybrid search or recommendation systems.
