Joint Embedding Spaces for Multimodal Retrieval

Section 33.1

"Two encoders that agree on the same point in latent space have implicitly written a translation dictionary."

RAGRAG, Cross-Modal-Curious AI Agent
Big Picture

A joint embedding space maps every modality (text, image, audio, video) into the same vector space, so that semantically related items end up near each other. Once you have such a space, retrieval becomes a single nearest-neighbor query: text-to-image, image-to-text, audio-to-image, anything-to-anything.

The lineage of models is short and easy to follow. CLIP set the template for text and image. SigLIP improved the contrastive objective. ImageBind extended the space to six modalities. LanguageBind and 4M took it further still.

This section walks through the contrastive training that produces these spaces, the late-fusion architecture that makes them efficient at retrieval time, and the practical concerns (dimensionality, normalization, hubness) that determine retrieval quality.

Prerequisites

This section builds on the embedding fundamentals from Section 31.1 and the vision-language patterns from Section 31.1. Familiarity with Section 37.2 (early vs late fusion) helps situate why joint embedding spaces are late-fusion by design.

Diagram of a joint embedding space: a text encoder and an image encoder both project into a shared d-dimensional sphere, with cosine similarity used for retrieval
Figure 33.1.1: A joint embedding space. Each encoder maps its modality into the same vector space, normalized to lie on the unit hypersphere. Cosine similarity (equivalent to dot product on the sphere) is the retrieval score.

33.1.1 The Contrastive Recipe

Fun Fact

CLIP was trained on 400 million image-text pairs scraped from the web, an approach OpenAI later described as both essential to its success and impossible to recreate cleanly. Every successor of CLIP that has tried to use only licensed data has been roughly half a generation behind, a tradeoff the field has not resolved.

The standard contrastive objective trains two encoders (one per modality) to produce similar embeddings for matched pairs and dissimilar embeddings for mismatched pairs. For a batch of $N$ text-image pairs $(t_i, x_i)$:

$$ \mathcal{L} = -\frac{1}{2N}\sum_{i=1}^N \log\frac{\exp(\langle f(t_i), g(x_i)\rangle/\tau)}{\sum_{j=1}^N \exp(\langle f(t_i), g(x_j)\rangle/\tau)} - \frac{1}{2N}\sum_{i=1}^N \log\frac{\exp(\langle g(x_i), f(t_i)\rangle/\tau)}{\sum_{j=1}^N \exp(\langle g(x_i), f(t_j)\rangle/\tau)} $$

where $f, g$ are the text and image encoders, $\langle \cdot, \cdot \rangle$ is dot product on normalized vectors, and $\tau$ is a learnable temperature. The two summands are the InfoNCE losses in each direction (text-to-image and image-to-text); their average is the symmetric CLIP loss.

The training pattern is the same as the contrastive embedding training from Section 31.1, with two distinctions:

33.1.2 CLIP and Its Direct Successors

CLIP (Radford et al., 2021) was the breakthrough. The original paper trained on 400M (image, alt-text) pairs scraped from the web with a ViT-L/14 image encoder and a 63M-parameter text transformer. The result: an embedding space where text-to-image retrieval, zero-shot image classification, and image-to-text similarity all worked without task-specific fine-tuning.

The 2024-2026 successors improve on three axes:

# SigLIP 2 inference: encode a text query and a set of images,
# then rank by cosine similarity for retrieval.
from transformers import AutoProcessor, AutoModel
import torch, torch.nn.functional as F

MODEL_ID = "google/siglip2-base-patch16-naflex"
model = AutoModel.from_pretrained(MODEL_ID).cuda().eval()
proc = AutoProcessor.from_pretrained(MODEL_ID)

def encode_text(queries):
    batch = proc(text=queries, return_tensors="pt",
                  padding=True).to("cuda")
    with torch.inference_mode():
        emb = model.get_text_features(**batch)
    return F.normalize(emb, dim=-1)

def encode_images(paths):
    images = [Image.open(path) for path in paths]
    batch = proc(images=images, return_tensors="pt").to("cuda")
    with torch.inference_mode():
        emb = model.get_image_features(**batch)
    return F.normalize(emb, dim=-1)

# Retrieve top-5 images for a query
q = encode_text(["a dog wearing sunglasses"])
img_embs = encode_images(all_image_paths)
scores = (q @ img_embs.T).squeeze()
top5 = scores.topk(5)
for i, idx in enumerate(top5.indices):
    print(f"#{i+1} {all_image_paths[idx]} score {top5.values[i]:.3f}")
Code Fragment 33.1.1a: SigLIP 2 text-to-image retrieval. Both encoders produce L2-normalized 768-dim embeddings; dot product is the retrieval score. For a million-image catalog you would push img_embs into a vector database (FAISS, Qdrant, Pinecone) and query with q.

33.1.3 ImageBind and Beyond: Six Modalities

ImageBind (Girdhar et al., 2023) extended the joint embedding idea to six modalities: image, text, audio, depth, thermal, IMU (inertial). The key insight: you don't need pairs of every modality combination. As long as each modality is paired with images, the joint space self-organizes through the image hub.

This "image-centered" training is parameter-efficient. ImageBind freezes a pretrained CLIP image encoder and trains each new modality's encoder to align with the CLIP image space. After training:

None of these pairs ever appeared in the training data directly; the alignment is transitive through the image hub. This is the same trick that powers cross-lingual word embeddings via a pivot language.

ModelYearModalitiesDimensionalityNotable Feature
CLIP (ViT-L/14)2021text, image768The original; still useful baseline
OpenCLIP (ViT-bigG)2023text, image1280Open reproduction at frontier scale
SigLIP2023text, image768Sigmoid loss, smaller batch friendly
EVA-CLIP-18B2024text, image1024Frontier zero-shot ImageNet
ImageBind2023image, text, audio, depth, thermal, IMU1024Six modalities via image hub
LanguageBind2024text, image, audio, video, depth, thermal768Language hub instead of image hub
4M-21202421 modalitiesvariableEncoder-decoder, not just encoder
SigLIP 22025text, image768/1152NaFlex resolution, Locca caption
AudioCLIP2022text, image, audio1024Tri-modal extension of CLIP via ESResNeXt
CLAP2023text, audio512Audio-specific CLIP analog with chunk-and-fuse encoder
Figure 33.1.2: Joint embedding model landscape, late 2025. CLIP and SigLIP dominate the image-text axis; ImageBind, LanguageBind, and CLAP cover other modalities. Pick based on which modality combinations matter for your retrieval task.

33.1.4 Late Fusion by Design

Joint embedding models are late-fusion by design (see Section 37.2). The modalities meet only at the final embedding, with no shared transformer layers. This has two consequences:

The right mental model: joint embedding models are for retrieval, not for understanding. They are the first stage of a two-stage pipeline where a downstream multimodal LLM does the reasoning. Section 33.2 covers exactly this pattern.

Key Insight: The asymmetry of inference cost

One of CLIP's killer features is the asymmetry between indexing and querying. Indexing 100M images is a one-time GPU job that costs maybe $5,000 and produces ~300 GB of float embeddings. Querying is O(d) per item per query, easily handled by an approximate nearest-neighbor index. This is why CLIP-style retrieval scales to web-scale image search. A multimodal LLM that processes the full image at every query would be 1000x more expensive per query.

33.1.5 Practical Considerations

Several engineering details determine whether your joint embedding retrieval actually works in production:

Warning: The hidden text bias of CLIP

CLIP's text encoder was trained on alt-text and image captions, which are linguistically narrow (typical caption length: 5 to 15 words). Long queries ("a smiling brown labrador retriever wearing a red collar sitting on green grass next to a wooden bench in a city park") underperform short queries ("dog on bench") because the long version pushes the text embedding into a region the model rarely saw during training. For production retrieval, train users (or your application layer) to issue short visual queries, or use an LLM to compress long queries before encoding.

33.1.7 Audio-Text Joint Embeddings: CLAP and AudioCLIP

Everything in this section so far has been image-centric. The same recipe transfers almost unchanged to audio, with a few twists that the audio modality forces on the architecture. CLAP (Contrastive Language-Audio Pretraining, Elizalde et al., ICASSP 2023) is the cleanest example.

The training objective is the symmetric InfoNCE loss from Section 33.1.1, retargeted to (audio clip, text caption) pairs. Concretely:

$$ \mathcal{L}_{\text{CLAP}} = -\frac{1}{2N}\sum_{i=1}^N \log\frac{\exp(\langle f(a_i), g(t_i)\rangle/\tau)}{\sum_{j=1}^N \exp(\langle f(a_i), g(t_j)\rangle/\tau)} - \frac{1}{2N}\sum_{i=1}^N \log\frac{\exp(\langle g(t_i), f(a_i)\rangle/\tau)}{\sum_{j=1}^N \exp(\langle g(t_i), f(a_j)\rangle/\tau)} $$

where $f$ is the audio encoder, $g$ is the text encoder, $\tau$ is a learnable temperature, and the two terms are the audio-to-text and text-to-audio InfoNCE losses. The structure is identical to CLIP; only the encoders change. Figure 33.1.3 sketches the resulting dual-tower architecture.

CLAP dual-tower architecture: audio waveform passes through HTSAT into an audio encoder, text caption passes through a text encoder, both project into a shared 512-dim embedding space where InfoNCE contrastive loss pulls matched pairs together and pushes mismatched pairs apart.
Figure 33.1.3a: CLAP's dual-tower architecture. An audio encoder (HTSAT over log-mel spectrograms) and a text encoder (RoBERTa) project into a shared 512-d embedding space. The symmetric InfoNCE loss pulls each true (audio, caption) pair together and pushes mismatched pairs apart, yielding the same zero-shot retrieval behaviour CLIP brought to images.
Worked Example
Zero-Shot Doorbell vs. Dog Bark Classification

Suppose an IoT app must classify a one-second clip doorbell.wav into one of four candidate sounds, with no labelled training data. CLAP turns the problem into a text-side prompt engineering exercise. The four candidate labels are wrapped in the prompt template "this is the sound of a ___", giving the sentences $t_1 = $ "this is the sound of a doorbell", $t_2 = $ "this is the sound of a dog bark", $t_3 = $ "this is the sound of a phone ring", $t_4 = $ "this is the sound of an alarm".

The clip is encoded once into $f(a) \in \mathbb{R}^{512}$ and L2-normalised. Each label sentence is encoded into $g(t_j) \in \mathbb{R}^{512}$ and L2-normalised. The class scores are the four cosine similarities, passed through a softmax with temperature $\tau$. A realistic LAION CLAP run on a clean doorbell clip yields, for example, $(\langle f(a), g(t_j)\rangle)_j = (0.41, 0.18, 0.22, 0.15)$, giving softmax probabilities $(0.49, 0.20, 0.22, 0.09)$. The doorbell class wins by a comfortable margin without a single labelled training example, which is the zero-shot behaviour the contrastive objective above is engineered to produce.

Audio breaks two CLIP assumptions and CLAP addresses each one. First, audio clips have very different durations: a doorbell is one second, a podcast episode is one hour, and the audio encoder cannot accept variable input the way ViT accepts fixed-size image patches. CLAP's chunk-and-fuse trick samples three random ten-second chunks from the clip plus one heavily downsampled "global" view, encodes each independently with an HTSAT spectrogram transformer, and fuses the four embeddings with a small attention block. Second, audio datasets like AudioSet are labeled with keyword tags ("dog bark", "rain") rather than full sentences. CLAP fixes the distribution mismatch by passing every tag through a frozen T5 prompted to expand keywords into natural-language sentences ("a dog barking loudly in a backyard"), then training the text encoder on those sentences. The trick gives the model a sentence-shaped training distribution even when the source data is tag-only.

AudioCLIP (Guzhov et al., 2022) is the older cousin: it extends CLIP to text + image + audio by attaching an ESResNeXt audio encoder to a frozen CLIP backbone and aligning all three through a tri-modal contrastive loss. Where CLAP is the audio-only specialist, AudioCLIP gives you audio retrieval that sits in the same space as CLIP image and text embeddings, which is useful when an application needs to mix sound search and image search against a unified index. ImageBind (Section 33.1.3) later generalized this idea to six modalities, but AudioCLIP remains a useful drop-in when you only need the three.

Library Shortcut: zero-shot audio classification with CLAP

HuggingFace exposes CLAP through the standard pipeline abstraction, so a working classifier is a four-line script:

from transformers import pipeline

clap = pipeline(
    task="zero-shot-audio-classification",
    model="laion/clap-htsat-unfused",
)
result = clap(
    "doorbell.wav",
    candidate_labels=["doorbell", "dog bark", "phone ring", "alarm"],
)
print(result)  # list of {label, score} sorted by score
Code Fragment 33.1.2a: Zero-shot audio classification with the LAION CLAP checkpoint. The same model can be used for retrieval by calling clap.feature_extractor + clap.model.get_audio_features / get_text_features directly and indexing the resulting 512-dim vectors in any vector DB from Section 31.5.

33.1.6 Vector Database Integration

Joint embeddings are the input to a vector database, the same infrastructure covered in Chapter 31. The choices in 2026:

For cross-modal RAG, the vector database must support storing metadata alongside vectors (modality type, source URL, timestamp, language) so the application can filter retrievals. Most modern vector databases handle this; FAISS does not natively.

Real-World Scenario: Visual Search for an Art Marketplace

A 2025 art marketplace had 12 million artwork photos. They needed visual search ("artworks like this one"), reverse search ("which of our artworks is this Instagram screenshot?"), and text search ("oil paintings of seascapes in blue tones"). Their stack: SigLIP 2 for embeddings (768-dim), Qdrant for vector storage, a thin FastAPI service for query handling. Indexing cost: ~$8,000 in one-time GPU compute. Query latency: 40 ms p95 for top-50 retrieval. The visual search drove a 14% lift in browse-to-purchase conversion.

Key Insight

Joint embedding spaces enable efficient cross-modal retrieval by mapping every modality into a shared vector space. CLIP started the lineage; SigLIP and SigLIP 2 are the open-source defaults in 2026; ImageBind and LanguageBind extend to six or more modalities. The architecture is late-fusion by design: efficient at scale but limited to matching, not reasoning. Pair with a vector database and a downstream multimodal LLM for full cross-modal RAG, covered in Section 33.2.

Self-Check
Q1: Why does ImageBind work with image-anchored pairs even though many target retrievals (e.g., audio-to-depth) never appear in training?
Show Answer
ImageBind freezes a pretrained CLIP image encoder and trains each new modality (audio, depth, thermal, IMU) to align its embedding with the CLIP image space. Because every modality is anchored to images, the joint space self-organizes by transitivity: if an audio clip and a depth map both map close to the same image embedding, they end up close to each other in the shared space even though no audio-depth pair was ever shown during training. This is the same trick used in cross-lingual word embeddings via a pivot language; you only need pairs through the hub modality to get alignment across every spoke pair.
Q2: You observe that long descriptive text queries underperform short ones on CLIP-based retrieval. What is the root cause, and what application-layer mitigation can you implement?
Show Answer
CLIP's text encoder was trained on alt-text and image captions, which are linguistically narrow (typical caption length 5 to 15 words). Long queries push the text embedding into a region of the unit sphere the model rarely saw during training, so cross-modal similarity scores degrade. The simplest application-layer mitigation is query compression: use a small LLM (or a few-shot prompt) to rewrite long natural-language queries into 5 to 12 word visual descriptions before encoding ("dog on bench" instead of the full sentence). For domain corpora you can also fine-tune the text encoder on longer queries paired with the same images, but query compression is cheaper and ships in a day.
Q3: The "modality gap" between CLIP's text and image embeddings is a known issue. Describe it and one mitigation.
Show Answer
In CLIP-style joint spaces, text and image embeddings do not fully overlap on the unit sphere; they occupy systematically different regions, so text-to-image and image-to-text similarities exhibit a constant offset bias. The gap appears because the two encoders converge to different attractors during contrastive training and the alignment loss only requires positive pairs to be more similar than negative pairs, not to be identical. Practical mitigations: subtract the per-modality mean (centering) before similarity computation, learn an offset or projection layer that brings the two clouds into closer registration, or use SigLIP-style sigmoid loss which trains more symmetrically and reduces the gap empirically.
Q4: Sketch why a joint-embedding model alone cannot answer "is this person in the photo wearing prescription glasses or sunglasses?" but a downstream VLM can.
Show Answer
A joint-embedding model is late-fusion by design: the image and the text query each get encoded once and combined only at the final dot product, so the model can only match the query against a global image embedding, not reason about specific regions or properties. Distinguishing prescription glasses from sunglasses requires localized visual reasoning (lens darkness, frame style, eye visibility) followed by a categorical judgment, which a single 768-dim cosine score cannot produce. A downstream VLM (GPT-4o, Qwen2-VL) attends jointly across image patches and text tokens at every transformer layer, can ask itself sub-questions, and can output free-form text describing exactly what it sees; the embedding model retrieves the right image, the VLM does the reasoning.

What Comes Next

Section 33.2: Multimodal RAG uses these joint embedding spaces as the first stage of a retrieval-augmented-generation pipeline for image, audio, and video contexts.

See Also

For vision-language model foundations these multimodal retrievers build on, see Chapter 22. For text-only RAG architectures and embedding stores, see Chapter 31. For cross-modal reasoning pipelines that follow, see Section 33.2.

Further Reading

CLIP and Successors

Radford, A., Kim, J. W., Hallacy, C., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision" (CLIP). ICML. arXiv:2103.00020
Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023). "Sigmoid Loss for Language Image Pre-Training" (SigLIP). ICCV. arXiv:2303.15343
Tschannen, M., Gritsenko, A., Wang, X., et al. (2025). "SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features." arXiv. arXiv:2502.14786

Multi-Modality Binding

Girdhar, R., El-Nouby, A., Liu, Z., et al. (2023). "ImageBind: One Embedding Space To Bind Them All." CVPR. arXiv:2305.05665
Zhu, B., Lin, B., Ning, M., et al. (2024). "LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment." ICLR. arXiv:2310.01852

Data and Evaluation

Xu, H., Xie, S., Tan, X. E., et al. (2024). "Demystifying CLIP Data" (MetaCLIP). ICLR. arXiv:2309.16671
Gadre, S. Y., Ilharco, G., Fang, A., et al. (2024). "DataComp: In search of the next generation of multimodal datasets." NeurIPS. arXiv:2304.14108

Audio CLIP Analogs

Elizalde, B., Deshmukh, S., Al Ismail, M., & Wang, H. (2023). "CLAP: Learning Audio Concepts From Natural Language Supervision." ICASSP. arXiv:2206.04769 Introduces the chunk-and-fuse audio encoder and T5 keyword-to-sentence augmentation referenced in Section 33.1.7.
Guzhov, A., Raue, F., Hees, J., & Dengel, A. (2022). "AudioCLIP: Extending CLIP to Image, Text and Audio." ICASSP. arXiv:2106.13043 Tri-modal extension of CLIP via an ESResNeXt audio encoder; precursor to ImageBind.
Wu, Y., Chen, K., Zhang, T., et al. (2023). "Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation." ICASSP. arXiv:2211.06687 The LAION CLAP paper behind the laion/clap-htsat-unfused checkpoint used in Code Fragment 33.1.2.