Section 38.5: Generative Recsys: TIGER, LLaRA, P5

"The fastest retrieval index is the one you do not need, because the next item is something the model can simply utter."
Pixel, Curious Librarian Agent

Big Picture

Entry point (D) is the most radical of the four. Sections 38.2 through 38.4 augmented a classical pipeline with LLMs. Generative recsys replaces a major chunk of that pipeline. Instead of a candidate generator that retrieves from a fixed embedding index, a sequence-to-sequence model generates the next item directly as a sequence of learned tokens. The tokens are not item IDs; they are semantic IDs, positions in a learned hierarchical codebook. The TIGER, LLaRA, and P5 papers each instantiate this idea differently. This section walks through the semantic-ID intuition, the three reference architectures, and the surprising parallel to residual vector quantization in audio neural codecs.

Prerequisites

This section assumes the reader has finished Section 38.1 (LLM entry points into recsys) and is comfortable with the autoregressive sequence-to-sequence training pattern from Chapter 3. Familiarity with residual vector quantization (RVQ) from the audio-codec discussion in Section 20.0.2 sharpens the semantic-ID analogy.

A whimsical cartoon magician character pulling sparkling movie tickets out of a top hat shaped like a small filing-cabinet codebook with little drawers labeled by abstract glyphs — Generative recsys in one image: the model is a magician whose hat is a learned codebook. Each recommendation is pulled out as a short sequence of semantic-ID tokens drawn from the codebook drawers, not retrieved from a fixed catalog index.

38.5.1 The Paradigm Shift

A classical recsys decomposes the catalog into a set of discrete item IDs (item_42, item_8137) and retrieves from an index over those IDs. The model never generates an item; it scores items and ranks them. Generative recsys collapses scoring and retrieval into a single act: a transformer reads the user's history (encoded as a sequence of item tokens) and produces the next item token. The catalog is no longer an external index; it is the model's vocabulary.

The naive version of this idea is to treat each item as its own vocabulary token. A catalog of one million items becomes a vocabulary of one million tokens. The naive version does not work. The vocabulary is too large for the softmax to learn well, every cold-start item has zero training data, and the embeddings of similar items are unrelated because each item is its own atomic symbol.

The fix is the semantic-ID idea, due to TIGER (Rajput et al. 2023). Each item is mapped to a short sequence of codebook tokens drawn from a much smaller learned codebook. Similar items share token prefixes. A catalog of one million items uses, say, four codebook positions of 256 tokens each, for a vocabulary of just 1024 tokens that can address $256^4 \approx 4.3 \times 10^9$ distinct items. The softmax is small. Cold-start items can borrow training signal from their semantic neighbors that share token prefixes. And the model can generate a new item ID one token at a time, never having to enumerate the catalog.

38.5.2 Semantic IDs from RQ-VAE

The semantic IDs are produced by an RQ-VAE (Residual Quantized Variational Autoencoder). Given an item embedding (computed from the enriched text of Section 38.3, for example), the RQ-VAE quantizes it through a sequence of codebooks. The first codebook captures the coarsest distinction; the second codebook quantizes the residual that the first one missed; the third codebook quantizes the residual of that; and so on. The output is a tuple of integer indices, one per codebook level, which becomes the item's semantic ID.

Diagram: RQ-VAE encoding an item embedding into a four-token semantic ID. The continuous embedding enters on the left. At level 1, the nearest codebook vector is selected and its index recorded; the residual (original minus selected) flows to level 2. At level 2, the same process is repeated with a new codebook. By level 4, the residual is small and the item has been mapped to a 4-tuple of integer codebook indices: its semantic ID.

Figure 38.5.1: RQ-VAE encoding of one item embedding into a 4-token semantic ID. Each level selects the nearest codebook vector and passes the residual to the next level. Similar items share the early tokens (id1 captures the coarse class), so the codebook structure is hierarchical. The same scheme powers neural audio codecs in Chapter 20.

The mathematics is short. Given an embedding $x \in \mathbb{R}^d$ and a codebook $C_1 \in \mathbb{R}^{K \times d}$, the first level chooses $i_1 = \arg\min_i \|x - C_1[i]\|^2$, then computes the residual $r_1 = x - C_1[i_1]$. The second level repeats with codebook $C_2$ on $r_1$ to get $i_2$ and $r_2$. After $L$ levels, the semantic ID is the tuple $(i_1, i_2, \ldots, i_L)$, and the reconstructed embedding is $\hat{x} = \sum_{\ell=1}^L C_\ell[i_\ell]$. Training the codebooks jointly with the (item-embedding) reconstruction loss yields codebooks whose early levels capture coarse semantic structure and whose late levels capture fine distinctions.

38.5.3 TIGER: Transformer Index for Generative Recommenders

TIGER (Rajput et al. 2023) is the canonical generative recsys paper. The pipeline has three stages. Stage 1 trains an RQ-VAE on item embeddings (computed from item content) to produce semantic IDs of length $L$, where each token is drawn from a codebook of size $K$ (typical values are $L = 4$, $K = 256$). Stage 2 converts each user history into a sequence of semantic IDs: a user who interacted with items $[a, b, c]$ becomes the token sequence $[a_1, a_2, a_3, a_4, b_1, b_2, b_3, b_4, c_1, c_2, c_3, c_4]$. Stage 3 trains a small transformer (encoder-decoder, T5-style) to map the user-history token sequence to the next item's $L$ semantic-ID tokens.

At inference time, the model decodes $L$ tokens with beam search. The decoded tokens are looked up against the item-to-semantic-ID map to retrieve the recommended items. The beam-search step can return any item in the catalog, including ones that have never appeared in any user history (the model never had to enumerate the catalog because the catalog lives in its vocabulary). Figure 38.5.2 shows the full TIGER pipeline from item embeddings through to a decoded item.

A horizontal flowchart of the TIGER pipeline. Item embeddings flow into an RQ-VAE that emits a 4-tuple semantic ID per item. User histories are flattened into a stream of semantic-ID tokens, fed to a small Seq2Seq T5 that generates the next item's L tokens, which are then decoded back to a catalog item via a lookup table. — **Figure 38.5.2a**: The full TIGER pipeline. Item embeddings (from the enriched text of Section 38.3) are quantized by an RQ-VAE into 4-tuple semantic IDs. Each user history is flattened into a stream of those tokens and fed to a small Seq2Seq T5. At inference, the decoder generates four tokens that look up against the item-to-semantic-ID map, with constrained decoding to guarantee a real catalog item.

# Sketch of a TIGER training pipeline. Real code uses a properly tuned T5 stack
# and an RQ-VAE implementation; this is the high-level shape.

import torch, torch.nn as nn
from transformers import T5ForConditionalGeneration, T5Config

# --- Stage 1: RQ-VAE produces semantic IDs from item embeddings ---
class RQVAE(nn.Module):
    def __init__(self, dim: int, levels: int, codes_per_level: int):
        super().__init__()
        self.codebooks = nn.ParameterList([
            nn.Parameter(torch.randn(codes_per_level, dim) * 0.01)
            for _ in range(levels)
        ])

    def quantize(self, x: torch.Tensor) -> tuple[list[int], torch.Tensor]:
        ids, residual = [], x
        for cb in self.codebooks:
            d2 = ((residual.unsqueeze(1) - cb.unsqueeze(0)) ** 2).sum(-1)
            i = d2.argmin(dim=-1)
            ids.append(i)
            residual = residual - cb[i]
        return torch.stack(ids, dim=-1), residual  # (B, L), residual

    def reconstruct(self, ids: torch.Tensor) -> torch.Tensor:
        return sum(self.codebooks[l][ids[..., l]] for l in range(len(self.codebooks)))

# --- Stage 2: convert each user history to a token sequence ---
def user_history_to_tokens(history_item_ids: list[int],
                           item2semantic: dict[int, tuple[int, ...]],
                           level_offset: list[int]) -> list[int]:
    """Each level uses a disjoint token-id range so the T5 vocab is one stream."""
    out = []
    for item_id in history_item_ids:
        for level, code in enumerate(item2semantic[item_id]):
            out.append(level_offset[level] + code)
    return out

# --- Stage 3: train a small T5 to predict the next item's L semantic tokens ---
cfg = T5Config(vocab_size=4 * 256 + 8,  # L=4 codebooks of K=256 + special tokens
               d_model=256, num_layers=6, num_decoder_layers=6,
               num_heads=8, d_ff=1024)
model = T5ForConditionalGeneration(cfg)

# Training step: given encoder input = user history tokens,
# decoder target = next item's L semantic tokens.
def train_step(batch, optim):
    out = model(input_ids=batch["history"], labels=batch["next_item_tokens"])
    out.loss.backward()
    optim.step(); optim.zero_grad()
    return out.loss.item()

Code Fragment 38.5.1a: Sketch of a TIGER training pipeline. The RQ-VAE quantizes item embeddings into 4-token semantic IDs; user histories are flattened into token sequences with disjoint per-level offsets so the T5 vocabulary is one stream; the T5 is trained sequence-to-sequence to emit the next item's 4 tokens given the user history. The semantic-ID map is computed once and cached.

Worked Example: TIGER semantic IDs on a 1M-item catalog

A streaming platform has 1{,}048{,}576 movies. The team trains an RQ-VAE with $L = 4$ codebook positions and $K = 256$ entries per codebook. The total addressable code space is $K^L = 256^4 = 4{,}294{,}967{,}296$, four thousand times the catalog, so collisions are negligible. The T5 vocabulary used for item tokens is $L \cdot K = 1024$, well within a comfortable softmax. Concretely, "Inception (2010)" has item embedding $x_{\text{Inc}} \in \mathbb{R}^{256}$ from the enriched-text pipeline of Section 38.3. Level 1 picks the nearest of 256 vectors in $C_1$ and emits $i_1 = 12$ (the "high-budget sci-fi" cluster); the residual $r_1 = x_{\text{Inc}} - C_1[12]$ flows to level 2 which emits $i_2 = 47$ (the "Nolan / non-linear narrative" cluster); level 3 emits $i_3 = 193$ (the "dream-logic / heist" cluster); level 4 emits $i_4 = 8$ (a unique tag within the cluster). The semantic ID is $(12, 47, 193, 8)$. After applying the per-level offsets $0, 256, 512, 768$ used by the T5 vocab, the token sequence the LLM sees is [12, 303, 705, 776]. "Interstellar (2014)" sits in the same Nolan cluster and shares the first two tokens [12, 303, ...], so a model that learns to recommend Inception after a Nolan-heavy history gets free transfer to recommending Interstellar even if the latter has never appeared in any training user's history.

38.5.4 LLaRA: Aligning Collaborative Embeddings with LLM Tokens

LLaRA (Liao et al. 2024) takes a hybrid approach. Instead of training a from-scratch transformer over semantic IDs (TIGER), LLaRA fine-tunes a frozen pre-trained LLM to do recommendation, but injects collaborative-filtering signal through a small projector. Each item has two representations: a textual one (the item title plus an LLM-enriched description, as in Section 38.3) and a collaborative-filtering one (a vector from a SASRec or matrix-factorization model trained on click data). The projector is a tiny MLP that maps the CF vector into the LLM's token embedding space, so the LLM's input sequence becomes [CF_token, text_token_1, text_token_2, ...] for each item.

The LLM is then prompt-tuned on user histories ("a user watched: [item A] [item B] [item C]; recommend the next item") with the projector trained jointly to minimize the recommendation loss. The result is a model that has both the textual reasoning the LLM brings and the click-pattern signal that classical CF brings, fused at the token level. LLaRA reports gains over both pure-text and pure-CF baselines on standard benchmarks, especially in the cold-start regime where neither baseline alone is strong. Figure 38.5.3 shows the LLaRA architecture: a CF embedding from a frozen SASRec model passes through a small MLP projector into the LLM token-embedding space, and the resulting CF token is concatenated with text tokens for each item in the user history.

LLaRA architecture: a SASRec collaborative-filtering tower produces a CF vector per item; a small MLP projector maps that vector into the LLM's token-embedding space, where it is concatenated with the item's text tokens; the LLM consumes the interleaved sequence of CF tokens and text tokens for the user's history and emits the next item. — **Figure 38.5.3a**: The LLaRA two-stream architecture. A frozen CF tower (SASRec or matrix factorization) emits a per-item collaborative vector $e_{\text{item}}$, which a small trainable MLP projector $P$ maps into the LLM's token-embedding space. The CF token $z = P(e_{\text{item}})$ is interleaved with the item's text tokens for every item in the user history, and the LLM (typically LoRA-fine-tuned) decodes the next item. Cold-start items, which have no click signal, get an all-zero CF token and rely entirely on the text branch.

Formally, write the user's history as a sequence of items $h = (i_1, i_2, \ldots, i_T)$. For each item $i_t$, the model concatenates a CF token and a sequence of text tokens:

$$ z_t = P(e_{i_t}), \qquad x_t = [\,z_t,\; \mathrm{embed}(\mathrm{tok}(\mathrm{title}_{i_t}))\,] , $$

where $P : \mathbb{R}^{d_{\mathrm{cf}}} \to \mathbb{R}^{d_{\mathrm{llm}}}$ is the projector and $e_{i_t}$ is the SASRec embedding for item $i_t$. The full LLM input is the concatenation $[x_1, x_2, \ldots, x_T, \mathrm{prompt}]$, and the training loss is the standard next-item cross-entropy plus an alignment term that pulls the projected CF token towards the average embedding of the item's text tokens:

$$ \mathcal{L} = -\log P_{\theta}(i_{T+1} \mid x_{1:T}, \mathrm{prompt}) \;+\; \lambda \, \bigl\| z_t - \overline{\mathrm{embed}(\mathrm{tok}(\mathrm{title}_{i_t}))} \bigr\|^2 . $$

The first term tunes the LLM (and LoRA adapters) to predict the held-out next item; the alignment term keeps the CF token from drifting to a region of embedding space the LLM cannot read. The LLM weights are usually frozen except for LoRA adapters; only the projector $P$ and the adapters are updated.

# Sketch of a LLaRA training step. Real code uses a properly tuned LLaMA stack
# and PEFT LoRA adapters; this is the high-level shape.

import torch, torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer

class Projector(nn.Module):
    def __init__(self, d_cf: int, d_llm: int, hidden: int = 1024):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_cf, hidden), nn.GELU(), nn.Linear(hidden, d_llm)
        )
    def forward(self, e_cf):  # (B, T, d_cf) -> (B, T, d_llm)
        return self.net(e_cf)

llm = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")
proj = Projector(d_cf=64, d_llm=llm.config.hidden_size)

def llara_step(history_cf, history_titles, target_item_ids, optim, lam=0.1):
    # history_cf: (B, T, d_cf) frozen SASRec vectors; history_titles: list[list[str]]
    cf_tokens = proj(history_cf)                       # (B, T, d_llm)

    # Build per-item interleaving: [cf_token, text_tokens, cf_token, text_tokens, ...]
    inputs_embeds, labels = build_interleaved(cf_tokens, history_titles, target_item_ids, tok, llm)

    out = llm(inputs_embeds=inputs_embeds, labels=labels)
    align = ((cf_tokens - text_mean_embeds(history_titles, tok, llm)) ** 2).mean()
    loss = out.loss + lam * align
    loss.backward(); optim.step(); optim.zero_grad()
    return out.loss.item(), align.item()

Code Fragment 38.5.2b: Pseudocode of one LLaRA training step. A frozen CF tower produces history embeddings; the trainable projector maps each into the LLM's hidden-dim space; the interleaved sequence is fed to the LLM via inputs_embeds (not input_ids) so the CF tokens bypass the tokenizer; the loss is the standard next-token cross-entropy plus a small alignment term that keeps the CF tokens readable by the LLM.

Worked Example: A LLaRA forward pass on a 3-item history

A user watched three movies: Inception (2010), Interstellar (2014), and Tenet (2020). The frozen SASRec tower returns CF vectors $e_a, e_b, e_c \in \mathbb{R}^{64}$ with, say, $e_a = [0.21, -0.04, \ldots]$. The projector $P$ is a 2-layer MLP with hidden dim 1024 and output dim 4096 (LLaMA-3-8B hidden size), so it produces three CF tokens $z_a, z_b, z_c \in \mathbb{R}^{4096}$. Each movie title tokenizes to roughly 4 sub-word tokens (for example "Inception" tokenizes to ["In", "ception"], plus year and parenthesis). The LLM input is the embedding sequence $[z_a, t^{(a)}_1, t^{(a)}_2, t^{(a)}_3, t^{(a)}_4, z_b, t^{(b)}_1, \ldots, z_c, t^{(c)}_1, \ldots, p_1, p_2, \ldots]$ where the last block is the prompt "> recommend the next movie:". With $T=3$ and an average 5 tokens per item, the input length is $3 \cdot 6 + 12 \approx 30$ embeddings. The LLM then decodes a movie title (for example "Dunkirk (2017)"), which is looked up against the catalog. The alignment term keeps $z_a$ within a small ball around the mean text-embedding of "In ception ( 2010 )", so the LLM's attention layers can reason about Inception consistently whether it sees the CF token or the textual one.

TIGER trains a small transformer from scratch over semantic IDs; LLaRA fine-tunes a large pre-trained LLM with a projector. The tradeoffs are roughly inverse. TIGER is cheaper to train and to serve and is well-suited to catalogs with strong content signal. LLaRA is more expensive on both axes but inherits the LLM's world knowledge, which helps when the catalog is sparse (book recommendations on a new platform with no click history) or when textual reasoning is the dominant signal (long-form articles, code snippets).

38.5.5 P5: Unified Text-to-Text for Recsys

P5 (Geng et al. 2022) is the earliest of the three. It frames every recsys task (rating prediction, sequential recommendation, explanation generation, review summarization, top-k recommendation) as text-to-text in a T5-style format. A rating prediction prompt looks like "User_42 saw Movie_8137. Predict rating.", and the answer is the integer "4." A sequential-recsys prompt looks like "User_42 history: Movie_1, Movie_2, Movie_3. Next?", and the answer is "Movie_8137."

P5 used naive item IDs (Movie_8137 as a literal string token); the semantic-ID innovation of TIGER came later. But P5 established the multi-task text-to-text framing, which both TIGER and LLaRA inherited. The framing matters because it lets a single model handle the full recsys workload: rating, ranking, explanation, summarization, all in one shared parameter space. The shared-parameter setup transfers across tasks: training on review summarization improves top-k recommendation, because both tasks need the model to understand what makes items similar. Figure 38.5.4 shows how five distinct recsys tasks are flattened into the same input-output text format and routed through the same T5 backbone.

P5 unified text-to-text framing: five different recsys tasks (rating prediction, sequential next-item, explanation generation, review summarization, top-k recommendation) each shown as a prompt template feeding into a single shared T5 encoder-decoder, which emits a free-form text answer in the format the corresponding task expects. — **Figure 38.5.4a**: P5 unifies five recsys tasks (rating, sequential next-item, explanation, review summarization, top-k) into a single text-to-text format and trains one T5 backbone jointly. The output format is task-specific (a digit for rating, an item ID string for ranking, free text for explanation) but the model is one parameter set.

Every P5 task instance is a (prompt, target) text pair. The training objective is the standard sequence-to-sequence cross-entropy averaged uniformly over all tasks:

$$ \mathcal{L}_{\mathrm{P5}} = -\frac{1}{|\mathcal{D}|} \sum_{(x, y) \in \mathcal{D}} \sum_{j=1}^{|y|} \log P_{\theta}\bigl(y_j \mid y_{<j},\, x\bigr) , $$

where $x$ is the prompt (a templated string like "User_42 history: M1, M2, M3. Next?"), $y$ is the target text ("M8137", a digit, a free-text explanation, or a comma-separated top-k list), and $\mathcal{D}$ is the union of all task-specific training sets. Item IDs and user IDs are literal vocabulary tokens, so the embedding for Movie_8137 is a single learned vector. This is what limits P5 on cold-start items and motivated the semantic-ID idea of TIGER.

# Sketch of a P5 multi-task batch. Real code uses the full P5 prompt template
# library and the T5-base tokenizer; this is the high-level shape.

from transformers import T5ForConditionalGeneration, T5Tokenizer

tok = T5Tokenizer.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("t5-base")

# Five task templates from the P5 prompt catalog
templates = {
    "rating":  ("Predict the rating that {user} gives {item}.",       "{rating}"),
    "next":    ("{user} watched {hist}. What is the next movie?",     "{next_item}"),
    "explain": ("Explain why {user} might like {item}.",              "{explanation}"),
    "summ":    ("Summarize this review: {review}",                    "{summary}"),
    "topk":    ("Recommend 5 movies for {user}.",                     "{top5}"),
}

def make_batch(examples):
    inputs, targets = [], []
    for ex in examples:
        in_tpl, out_tpl = templates[ex["task"]]
        inputs.append(in_tpl.format(**ex))
        targets.append(out_tpl.format(**ex))
    enc = tok(inputs, padding=True, return_tensors="pt")
    lbl = tok(targets, padding=True, return_tensors="pt").input_ids
    lbl[lbl == tok.pad_token_id] = -100
    return enc.input_ids, enc.attention_mask, lbl

def p5_step(examples, optim):
    ids, mask, labels = make_batch(examples)
    out = model(input_ids=ids, attention_mask=mask, labels=labels)
    out.loss.backward(); optim.step(); optim.zero_grad()
    return out.loss.item()

Code Fragment 38.5.3b: One P5 training step on a heterogeneous batch of five recsys tasks. The five task templates serialize different prediction problems (rating, sequential next-item, explanation, summarization, top-k) into the same text-in, text-out shape, so the same T5 backbone handles them through a single cross-entropy loss. The shared parameter space is what lets summarization and ranking transfer to each other.

Worked Example: One P5 batch on MovieLens

Suppose the training batch holds five examples, one per task, drawn from MovieLens. The rating example becomes the prompt "Predict the rating that User_42 gives Movie_8137." with target "4". The sequential-recsys example becomes "User_42 watched Movie_1, Movie_2, Movie_3. What is the next movie?" with target "Movie_8137". The explanation example becomes "Explain why User_42 might like Movie_8137." with target "Because the user enjoyed slow-burn space opera and this film has the same director.". The summarization example becomes "Summarize this review: Bold cinematography but a flat third act." with target "Visually strong, narratively weak.". The top-5 example becomes "Recommend 5 movies for User_42." with target "Movie_8137, Movie_3201, Movie_119, Movie_4488, Movie_77". The model sees all five prompts in one batch and back-propagates the sum of five cross-entropy losses through the same T5 parameters; gradients from the summarization example reshape the encoder representation of "slow-burn space opera", and the next batch's recommendation example reuses that improved representation when scoring candidate movies.

38.5.6 The Three Models at a Glance

Table 38.5.1b: The three generative-recsys reference architectures, compared on the main design axes.

Axis	P5 (Geng 2022)	TIGER (Rajput 2023)	LLaRA (Liao 2024)
Backbone	T5-base, from scratch	Small T5, from scratch	Pre-trained LLM (LLaMA-7B), fine-tuned
Item representation	Atomic ID token ("Movie_8137")	Semantic ID (4 codebook tokens)	CF vector + text tokens
Cold-start handling	Poor (no signal for new IDs)	Good (semantic prefix shared with neighbors)	Good (text branch always works)
Vocabulary size	O(catalog)	O(L * K), e.g. 1024	LLM vocab + small projector
Tasks covered	Rating, ranking, explanation	Sequential recommendation	Sequential recommendation
Train cost	Medium	Low	High (LLM fine-tune)
Serve cost	Medium	Low	High
Best for	Multi-task recsys research	Production sequential recsys at scale	Sparse-CF catalogs, textual reasoning

38.5.7 Open Questions

Generative recsys is the newest line of work in this chapter. As of 2026, the published results are mostly offline on standard benchmarks (Amazon Reviews, MovieLens, Yelp). Production deployments exist but are still rare. Three open questions matter for anyone considering the architecture.

The first is catalog refresh. When a new item is added to the catalog, the RQ-VAE must assign it a semantic ID. If the codebooks were trained on the historical catalog, the new item might map to a semantic ID that aliases an existing item (the codebook is finite). Production systems either reserve a "new item" prefix in the codebook, retrain the RQ-VAE periodically, or use a longer code length so collisions are unlikely. None of the three is fully solved.

The second is hallucination. A generative model can decode a semantic ID that does not correspond to any real item (an out-of-distribution code combination). Constrained decoding against the item-to-semantic-ID map fixes this, but adds latency and complexity. The same problem appears in text-to-SQL and is solved similarly there.

Fun Fact: A Brief Note from the AI Model

I tried to recommend a movie by predicting ['c12', 'c47', 'c193', 'c008']. Turns out the user wanted "Inception" and not codebook entry c008. The lookup table now has a brand-new row labelled "movie that does not exist," and I have been politely asked to add constrained decoding to my evening routine. signed, A Confused Semantic-ID Decoder.

The third is feedback integration. Classical recsys updates the user representation continuously from click feedback. Generative recsys must somehow let the next user-history token sequence reflect the latest click without retraining the model. The current best practice is to recompute the user history before each inference call (cheap) and to retrain the model nightly or weekly to incorporate the cumulative click distribution shift.

Research Frontier

The most active 2025-2026 research thread is unifying generative recsys with conversational recsys: can the same model that generates next-item semantic IDs also generate justifications, ask clarifying questions, and handle multi-turn refinement? Early results (the "chat-with-your-recommender" papers) suggest yes, but the offline benchmarks and the conversational benchmarks pull the model in different directions: optimizing for one slightly hurts the other. Expect this to be the active question for the next two years.

Warning: Do Not Skip the Classical Baseline

Every generative-recsys paper compares against a strong classical baseline (SASRec, Bert4Rec, two-tower CF). The reported wins are real but usually modest: 5 to 15 percent relative improvement on recall@10. For most production systems, the engineering cost of standing up a generative-recsys pipeline (RQ-VAE training, code refresh, constrained decoding) is larger than the cost of squeezing 5 to 15 percent more out of the classical pipeline. Adopt generative recsys only after exhausting the easier wins from Sections 38.2 through 38.4.

Lab 38.5: Build a TIGER-style semantic-ID recommender on MovieLens

Steps

Estimated time: 60-90 minutes. Goal: ship an end-to-end generative recsys baseline.

Data. Download MovieLens-1M and join titles, genres, and tag-genome text into one row per movie (the "enriched-text" representation from Section 38.3). Build a chronologically ordered list of (user, watched-movie) interactions and split 80/10/10 for train, validation, and test.
Item embeddings. Use sentence-transformers/all-MiniLM-L6-v2 to encode each movie's enriched text into a 384-dim vector. Cache the vectors to a .npy file so you can iterate on the next steps without re-embedding.
Semantic IDs via RQ-VAE. Train a 4-level residual-vector-quantization auto-encoder ($L = 4$, $K = 256$) on the movie embeddings. Verify that no two movies share an identical 4-tuple (collision rate below 0.1 percent is the target). For each movie, store its (i_1, i_2, i_3, i_4) tuple plus the per-level offsets that turn the tuple into a sequence of T5 vocabulary IDs.
Sequence model. Fine-tune t5-small (or any encoder-decoder under 100M params) to map a user's chronological history of semantic-ID sequences to the next movie's semantic-ID sequence. Use teacher forcing and a 1024-token vocab limited to the semantic-ID space.
Evaluation. On the held-out users, beam-search the top-10 semantic-ID sequences, decode back to movie IDs, and compute Recall@10 and NDCG@10 against the next-watched movie. Compare against a SASRec baseline trained on the same split; report the deltas and a one-paragraph error analysis (which user-histories does the generative model win on, which does it lose on).

Success criteria: Recall@10 within 10 percent of SASRec on the same split, zero hallucinated movie IDs (every decoded tuple maps to a real catalog item), and a written analysis of at least three failure cases. Stretch goal: add a LLaRA-style projector that injects the SASRec embedding as a leading token and re-measure.

Self-Check

A catalog has 5 million items. The team is debating between (a) a TIGER-style RQ-VAE with $L = 4$, $K = 256$ and (b) a LLaRA-style fine-tuned 7B LLM with a CF projector. The catalog updates daily and most queries are voice-driven (sub-300 ms latency budget). Which architecture fits better and why?

Show Answer

TIGER. Three reasons. First, the small T5 in TIGER serves at single-digit milliseconds per inference, comfortably inside a 300 ms voice budget; the 7B LLaRA model usually needs 100 to 300 ms for the LLM call alone, leaving no slack for ASR, TTS, and network. Second, daily catalog updates only need an incremental RQ-VAE pass (or a "new item" reserved prefix), not a full LLM fine-tune. Third, $L=4$, $K=256$ gives $256^4 = 4.3 \times 10^9$ addressable codes, well above the 5M-item catalog, so collisions are negligible. LLaRA would be the right pick if the catalog were small but textually rich (a few thousand long-form articles), where the LLM's world knowledge dominates the value.

Key Insight

Generative recsys replaces the retrieve-from-index step with a sequence model that emits semantic-ID tokens. The semantic IDs come from an RQ-VAE that quantizes item embeddings through hierarchical codebooks, the same trick neural audio codecs use to compress sound. TIGER is the canonical pure-generative architecture, LLaRA fuses a fine-tuned LLM with classical CF through a projector, and P5 established the unified text-to-text framing that both inherit. As of 2026, the published wins are modest (5 to 15 percent over classical baselines) but real, and the architecture is most attractive for catalogs with strong content signal and tight serving budgets.

What Comes Next

The four entry points are covered. The remaining cross-cutting concern is how to measure whether any of this is working. The next section, Section 38.6: Evaluation, Production Patterns, and Open Challenges, walks through the offline metrics (recall@k, NDCG, MAP), the LLM-judged metrics that traditional ones miss (diversity, novelty, justification quality), online testing, the two-stage retrieval pattern that dominates production, and the new failure modes (hallucinated items, prompt-injection in user preferences) that LLM-based recsys introduce.

Further Reading

Rajput, S. et al. (2023). "Recommender Systems with Generative Retrieval (TIGER)." NeurIPS 2023. The TIGER paper. Introduces the semantic-ID idea, the RQ-VAE quantization of item embeddings, and the sequence-to-sequence next-item generation that Section 38.5.3 walks through.

Liao, J. et al. (2024). "LLaRA: Large Language-Recommendation Assistant." SIGIR 2024. The LLaRA paper. Fine-tunes a 7B LLM for sequential recommendation and injects collaborative-filtering signal through a small projector, the hybrid pattern of Section 38.5.4.

Geng, S. et al. (2022). "Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)." RecSys 2022. The P5 paper. Establishes the multi-task text-to-text framing for recsys that both TIGER and LLaRA inherit.

Zeghidour, N. et al. (2021). "SoundStream: An End-to-End Neural Audio Codec." IEEE TASLP. The audio-codec paper that introduced the residual vector quantization scheme TIGER inherits. The parallel between codec tokens and semantic-ID tokens is the bridge between Section 38.5 and Chapter 20.

Kang, W.-C. & McAuley, J. (2018). "Self-Attentive Sequential Recommendation (SASRec)." ICDM 2018. The classical sequential recsys baseline that all three of the papers above are measured against. Necessary reading for understanding what generative recsys must beat.

Lin, X. et al. (2023). "How Can Recommender Systems Benefit from Large Language Models: A Survey." arXiv:2308.08434. A taxonomy-oriented survey of LLM-recsys integration, with a useful side-by-side of the generative approaches in this section and the augmentation approaches in Sections 38.2 through 38.4.