Part V: Retrieval and Conversation
Chapter 21: Building Conversational AI Systems

Memory & Context Management

Memory is what turns a sequence of isolated exchanges into a genuine relationship.

Echo Echo, Sentimental AI Agent
Big Picture

Memory is what transforms a stateless LLM into a conversational partner that remembers. Without memory, every conversation starts from zero, and the system forgets everything the user said 30 minutes ago. With well-designed memory, the system can recall user preferences from weeks ago, summarize long conversations without losing critical details, and maintain continuity across sessions. Building on the context window constraints discussed in Section 14.7, this section covers the full spectrum of memory architectures, from simple sliding windows to sophisticated self-managing memory systems like MemGPT/Letta, giving you the tools to choose and implement the right memory strategy for your application.

Prerequisites

Memory management in conversations builds on the dialogue architecture from Section 21.1 and connects to the embedding and retrieval concepts in Section 19.1 (for memory retrieval via embeddings). Understanding token limits and context window management from Section 10.2 is essential, as memory strategies are fundamentally about managing finite context windows effectively.

Different-sized containers representing short-term, long-term, and working memory in a conversational AI system
Figure 21.3.1: Memory management in conversational AI: short-term memory holds the current chat, long-term memory stores user preferences, and working memory juggles it all without spilling.

1. The Memory Problem in Conversational AI

LLMs process conversations through a fixed-size context window. When the conversation history exceeds this window, older messages are simply dropped, taking important information with them. This fundamental limitation creates several practical problems: the system forgets what the user said earlier in a long conversation, it cannot recall information from previous sessions, and it has no way to distinguish important details from routine exchanges.

Memory management in conversational AI addresses these problems through a layered architecture that mirrors (loosely) how human memory works. Short-term memory holds recent conversation turns in full fidelity. Long-term memory stores compressed summaries, key facts, and searchable records that can be retrieved when relevant using the techniques from Chapter 19. The challenge lies in deciding what to remember, how to compress it, and when to retrieve it.

Fun Fact

Human working memory holds roughly 7 items (plus or minus 2), a number established by George Miller in 1956. A 128K-token context window holds roughly 100,000 words. Yet both humans and LLMs still forget the important thing you told them ten minutes ago.

Key Insight

The layered memory architecture in conversational AI directly mirrors the three-store model of human memory proposed by Atkinson and Shiffrin (1968). Their model distinguished sensory memory (milliseconds), short-term memory (seconds to minutes, limited capacity), and long-term memory (potentially unlimited, requiring encoding and retrieval). The conversational system's raw message buffer corresponds to short-term memory with its limited window; the summarized, searchable long-term store corresponds to consolidated long-term memory. The process of summarizing recent turns before evicting them from the context window is analogous to the "rehearsal" and "consolidation" processes that transfer human short-term memories into long-term storage. Even the failure modes are parallel: humans suffer from retroactive interference (new memories overwrite old ones) and retrieval failure (the memory exists but cannot be found), both of which plague LLM memory systems when summaries lose detail or vector search returns irrelevant prior context.

Common Misconception: Larger Context Windows Replace Memory Systems

With models offering 128K or even 1M token context windows, a common assumption is that memory management is no longer needed: just stuff the entire conversation history into the prompt. This approach fails for three reasons. First, cost scales linearly with context length; sending 100K tokens per request is expensive at scale. Second, the lost-in-the-middle effect (covered in Section 20.1) means models pay less attention to information in the middle of long contexts, so important details from earlier in the conversation get overlooked. Third, cross-session memory requires persisting information beyond a single API call, which no context window can provide. A well-designed memory system with summarization, priority-based eviction, and vector-backed retrieval outperforms brute-force context stuffing on both cost and quality.

Tip

Start with the simplest memory strategy that meets your needs. A sliding window of the last 20 messages works for most single-session chatbots. Add summarization only when conversations regularly exceed your context window. Add long-term memory (vector-based retrieval of past sessions) only when cross-session personalization is a product requirement, not an assumed need.

Figure 21.3.2 shows the layered memory architecture.

LLM Context Window System prompt + memory context + recent turns + user message Short-Term Memory • Full conversation buffer (last N turns) • Sliding window with overlap • Token-counted message queue Long-Term Memory • Conversation summaries • Vector store (semantic search) • Entity/fact extraction Session Store • Per-session history • Session metadata • Resume/restore capability User Profile • Preferences & settings • Biographical facts • Interaction patterns
Figure 21.3.2: Layered memory architecture showing how short-term memory, long-term memory, session storage, and user profiles feed into the LLM context window.

2. Short-Term Memory Strategies

Short-term memory holds the most recent portion of the conversation in its original form. The simplest approach is a fixed-size buffer that keeps the last N messages. More sophisticated approaches use token-based budgeting to maximize the amount of conversation that fits within the context window. Code Fragment 21.3.7 below puts this into practice.

Token-Aware Sliding Window

This snippet implements a sliding-window context manager that trims conversation history to fit within the model's token limit.


# Define Message, SlidingWindowMemory; implement __init__, add_message, get_context
# Key operations: results display, memory management
import tiktoken
from dataclasses import dataclass, field

@dataclass
class Message:
 role: str
 content: str
 token_count: int = 0
 timestamp: float = 0.0
 importance: float = 1.0 # 0.0 to 1.0

class SlidingWindowMemory:
 """Token-aware sliding window that maximizes conversation retention
 within a fixed token budget."""

 def __init__(self, max_tokens: int = 4000, model: str = "gpt-4o"):
 self.max_tokens = max_tokens
 self.encoder = tiktoken.encoding_for_model(model)
 self.messages: list[Message] = []
 self.total_tokens = 0

 def add_message(self, role: str, content: str,
 importance: float = 1.0) -> None:
 """Add a message and evict oldest messages if over budget."""
 import time
 token_count = len(self.encoder.encode(content))
 msg = Message(
 role=role, content=content,
 token_count=token_count,
 timestamp=time.time(),
 importance=importance
 )
 self.messages.append(msg)
 self.total_tokens += token_count

 # Evict oldest messages until within budget
 while self.total_tokens > self.max_tokens and len(self.messages) > 1:
 removed = self.messages.pop(0)
 self.total_tokens -= removed.token_count

 def get_context(self) -> list[dict]:
 """Return messages formatted for the LLM API."""
 return [
 {"role": m.role, "content": m.content}
 for m in self.messages
 ]

 def get_token_usage(self) -> dict:
 """Report current memory utilization."""
 return {
 "used_tokens": self.total_tokens,
 "max_tokens": self.max_tokens,
 "utilization": self.total_tokens / self.max_tokens,
 "message_count": len(self.messages)
 }

# Usage
memory = SlidingWindowMemory(max_tokens=4000)
memory.add_message("user", "Hi, I'm looking for a new laptop.")
memory.add_message("assistant", "I'd be happy to help! What will you primarily use it for?")
memory.add_message("user", "Mostly software development and occasional video editing.")

print(memory.get_token_usage())
{'used_tokens': 42, 'max_tokens': 4000, 'utilization': 0.0105, 'message_count': 3}
Code Fragment 21.3.1: Define Message, SlidingWindowMemory; implement __init__, add_message, get_context
{'used_tokens': 48, 'max_tokens': 4000, 'utilization': 0.012, 'message_count': 3}

3. Long-Term Memory with Summarization

When conversations grow beyond what the sliding window can hold, summarization compresses older portions of the conversation into shorter representations. The key design decision is when to summarize and how to balance compression (saving tokens) against information retention (keeping important details).

Progressive Summarization

Progressive summarization works by maintaining multiple levels of compression. Recent messages are kept in full. Slightly older messages are summarized into a paragraph. Much older content is compressed into a single sentence or key-value pair. This approach preserves detail where it matters most (recent context) while retaining the gist of earlier exchanges. Code Fragment 21.3.10 below puts this into practice.


# Define ProgressiveSummarizationMemory; implement __init__, add_turn, _summarize_oldest
# Key operations: RAG pipeline, prompt construction, memory management
from openai import OpenAI

client = OpenAI()

class ProgressiveSummarizationMemory:
 """Memory system with progressive summarization layers."""

 def __init__(self, full_window: int = 10, summary_trigger: int = 8):
 self.full_messages: list[dict] = [] # Recent, full fidelity
 self.summaries: list[str] = [] # Compressed older content
 self.key_facts: list[str] = [] # Extracted important facts
 self.full_window = full_window
 self.summary_trigger = summary_trigger

 def add_turn(self, user_msg: str, assistant_msg: str) -> None:
 """Add a conversation turn, triggering summarization if needed."""
 self.full_messages.append({"role": "user", "content": user_msg})
 self.full_messages.append(
 {"role": "assistant", "content": assistant_msg}
 )

 # Trigger summarization when buffer is full
 if len(self.full_messages) >= self.full_window * 2:
 self._summarize_oldest()

 def _summarize_oldest(self) -> None:
 """Summarize the oldest messages and move to summary tier."""
 # Take the oldest half of messages
 to_summarize = self.full_messages[:self.summary_trigger * 2]
 self.full_messages = self.full_messages[self.summary_trigger * 2:]

 # Format for summarization
 conversation_text = "\n".join(
 f"{m['role'].title()}: {m['content']}"
 for m in to_summarize
 )

 # Generate summary
 response = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{
 "role": "user",
 "content": (
 "Summarize this conversation segment in 2-3 sentences. "
 "Preserve: user preferences, decisions made, "
 "unresolved questions, and key facts.\n\n"
 f"{conversation_text}"
 )
 }],
 temperature=0.3,
 max_tokens=200
 )
 summary = response.choices[0].message.content
 self.summaries.append(summary)

 # Extract key facts
 self._extract_facts(conversation_text)

 # Compress old summaries if they accumulate
 if len(self.summaries) > 5:
 self._compress_summaries()

 def _extract_facts(self, text: str) -> None:
 """Extract durable facts from conversation text."""
 response = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{
 "role": "user",
 "content": (
 "Extract key facts from this conversation that should "
 "be remembered long-term. Return as a bullet list. "
 "Focus on: user preferences, personal details, "
 "decisions, and important context.\n\n" + text
 )
 }],
 temperature=0.0,
 max_tokens=200
 )
 facts = response.choices[0].message.content.strip().split("\n")
 self.key_facts.extend(
 f.strip("- ").strip() for f in facts if f.strip()
 )

 def _compress_summaries(self) -> None:
 """Merge multiple summaries into a single compressed summary."""
 all_summaries = "\n".join(self.summaries)
 response = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{
 "role": "user",
 "content": (
 "Merge these conversation summaries into a single "
 "concise paragraph. Keep the most important details.\n\n"
 + all_summaries
 )
 }],
 temperature=0.3,
 max_tokens=200
 )
 self.summaries = [response.choices[0].message.content]

 def build_context(self, system_prompt: str) -> list[dict]:
 """Build the full context for an LLM call."""
 context = [{"role": "system", "content": system_prompt}]

 # Add key facts
 if self.key_facts:
 facts_text = "Key facts about this user:\n" + "\n".join(
 f"- {f}" for f in self.key_facts[-15:]
 )
 context.append({"role": "system", "content": facts_text})

 # Add conversation summaries
 if self.summaries:
 summary_text = (
 "Summary of earlier conversation:\n"
 + "\n".join(self.summaries)
 )
 context.append({"role": "system", "content": summary_text})

 # Add full recent messages
 context.extend(self.full_messages)

 return context
[summary] Discussed database options: PostgreSQL vs MongoDB. User leaning toward P... (score: 0.847) [fact] User is building a recipe recommendation app... (score: 0.612)
Code Fragment 21.3.2: Define ProgressiveSummarizationMemory; implement __init__, add_turn, _summarize_oldest
Key Insight

The most common mistake in conversation summarization is treating all information equally. User preferences ("I'm vegetarian"), decisions ("Let's go with the blue one"), and unresolved questions ("I still need to figure out the budget") are far more important to preserve than routine pleasantries or repeated information. A good summarization prompt explicitly prioritizes these categories of information.

4. Vector Store Memory

Vector store memory enables semantic retrieval of past conversation content. Rather than relying solely on recency (as the sliding window does), vector search retrieves the most relevant past exchanges based on what the user is currently discussing. This is particularly powerful for long-running relationships where a user might reference something from weeks ago. Code Fragment 21.3.7 below puts this into practice.


# Define MemoryEntry, VectorMemoryStore; implement __init__, store, retrieve
# Key operations: embedding lookup, results display, retrieval pipeline
from openai import OpenAI
import numpy as np
from dataclasses import dataclass
from typing import Optional

client = OpenAI()

@dataclass
class MemoryEntry:
 text: str
 embedding: list[float]
 timestamp: float
 session_id: str
 entry_type: str # "turn", "summary", "fact"
 metadata: dict = None

class VectorMemoryStore:
 """Semantic memory using embeddings for retrieval."""

 def __init__(self):
 self.entries: list[MemoryEntry] = []

 def store(self, text: str, session_id: str,
 entry_type: str = "turn",
 metadata: dict = None) -> None:
 """Embed and store a memory entry."""
 import time
 embedding = self._embed(text)
 entry = MemoryEntry(
 text=text,
 embedding=embedding,
 timestamp=time.time(),
 session_id=session_id,
 entry_type=entry_type,
 metadata=metadata or {}
 )
 self.entries.append(entry)

 def retrieve(self, query: str, top_k: int = 5,
 entry_type: Optional[str] = None,
 recency_weight: float = 0.1) -> list[dict]:
 """Retrieve the most relevant memories for a query."""
 query_embedding = self._embed(query)

 scored = []
 for entry in self.entries:
 if entry_type and entry.entry_type != entry_type:
 continue

 # Cosine similarity
 similarity = self._cosine_sim(
 query_embedding, entry.embedding
 )

 # Blend similarity with recency
 import time
 age_hours = (time.time() - entry.timestamp) / 3600
 recency_score = 1.0 / (1.0 + age_hours * 0.01)
 final_score = (
 (1 - recency_weight) * similarity
 + recency_weight * recency_score
 )

 scored.append({
 "text": entry.text,
 "score": final_score,
 "similarity": similarity,
 "entry_type": entry.entry_type,
 "session_id": entry.session_id,
 })

 scored.sort(key=lambda x: x["score"], reverse=True)
 return scored[:top_k]

 def _embed(self, text: str) -> list[float]:
 """Generate embedding for text."""
 response = client.embeddings.create(
 model="text-embedding-3-small",
 input=text
 )
 return response.data[0].embedding

 @staticmethod
 def _cosine_sim(a: list[float], b: list[float]) -> float:
 a, b = np.array(a), np.array(b)
 return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Example: Store and retrieve memories
store = VectorMemoryStore()
store.store(
 "User prefers Python over JavaScript for backend work",
 session_id="session_001", entry_type="fact"
)
store.store(
 "User is building a recipe recommendation app",
 session_id="session_001", entry_type="fact"
)
store.store(
 "Discussed database options: PostgreSQL vs MongoDB. "
 "User leaning toward PostgreSQL for relational data.",
 session_id="session_002", entry_type="summary"
)

# Later, when user asks about databases again
results = store.retrieve("What database should I use?", top_k=2)
for r in results:
 print(f"[{r['entry_type']}] {r['text'][:80]}... (score: {r['score']:.3f})")
Code Fragment 21.3.3: Define MemoryEntry, VectorMemoryStore; implement __init__, store, retrieve

5. MemGPT / Letta Architecture

MemGPT (now Letta) introduced a groundbreaking approach to memory management: instead of the application code managing memory, the LLM itself decides when and what to save, retrieve, and forget. This self-managed memory architecture is inspired by operating system virtual memory, where a hierarchical memory system creates the illusion of unlimited memory through intelligent paging between fast (context window) and slow (external storage) tiers. Figure 21.3.3 depicts the MemGPT/Letta architecture with its three memory tiers.

LLM Agent (Controller) Decides when to read/write/search memory Working Context In-context "scratchpad" Editable by the agent Fast, limited capacity Archival Memory Vector store (unlimited) Semantic search retrieval Slow, persistent Recall Memory Conversation history Searchable past turns Paginated access core_memory_append() archival_insert() archival_search() recall_search() core_memory_replace()
Figure 21.3.3: MemGPT/Letta architecture where the LLM agent manages its own memory through function calls across three tiers: working context, archival memory, and recall memory.
Note: MemGPT in Practice

The MemGPT approach requires the LLM to reliably use memory management functions. In practice, this works best with capable models (GPT-4 class or above) that can reason about when information should be saved for later versus kept in working memory. Smaller models tend to either save too much (filling archival memory with noise) or too little (failing to preserve important context). Careful prompt engineering for the memory management instructions is essential.

6. Session Persistence and User Profiles

For applications where users return across multiple sessions, persistent storage bridges the gap between conversations. A user profile system accumulates knowledge about the user over time, creating an increasingly personalized experience. The profile should capture stable preferences, biographical facts, and interaction patterns without storing sensitive data unnecessarily. Code Fragment 21.3.10 below puts this into practice.


# Define UserProfileManager; implement __init__, load_profile, save_profile
# Key operations: RAG pipeline, prompt construction, API interaction
import json
from datetime import datetime
from pathlib import Path

class UserProfileManager:
 """Manages persistent user profiles across sessions."""

 def __init__(self, storage_dir: str = "./user_profiles"):
 self.storage_dir = Path(storage_dir)
 self.storage_dir.mkdir(exist_ok=True)

 def load_profile(self, user_id: str) -> dict:
 """Load or create a user profile."""
 profile_path = self.storage_dir / f"{user_id}.json"
 if profile_path.exists():
 with open(profile_path) as f:
 return json.load(f)
 return self._create_default_profile(user_id)

 def save_profile(self, user_id: str, profile: dict) -> None:
 """Persist the user profile to disk."""
 profile["last_updated"] = datetime.now().isoformat()
 profile_path = self.storage_dir / f"{user_id}.json"
 with open(profile_path, "w") as f:
 json.dump(profile, f, indent=2)

 def update_from_conversation(self, user_id: str,
 conversation: list[dict]) -> dict:
 """Extract profile updates from a completed conversation."""
 profile = self.load_profile(user_id)

 # Use LLM to extract profile-worthy information
 extraction_prompt = f"""Analyze this conversation and extract any new
information about the user that should be remembered for future sessions.

Current profile:
{json.dumps(profile['preferences'], indent=2)}

Conversation:
{self._format_conversation(conversation)}

Return JSON with two fields:
- "new_preferences": dict of any new preferences discovered
- "new_facts": list of new biographical/contextual facts
- "corrections": dict of any corrections to existing profile data

Only include genuinely new or corrected information."""

 response = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{"role": "user", "content": extraction_prompt}],
 response_format={"type": "json_object"},
 temperature=0
 )

 updates = json.loads(response.choices[0].message.content)

 # Apply updates
 if updates.get("new_preferences"):
 profile["preferences"].update(updates["new_preferences"])
 if updates.get("new_facts"):
 profile["facts"].extend(updates["new_facts"])
 if updates.get("corrections"):
 profile["preferences"].update(updates["corrections"])

 # Update session count
 profile["session_count"] += 1
 profile["last_session"] = datetime.now().isoformat()

 self.save_profile(user_id, profile)
 return profile

 def get_context_string(self, user_id: str) -> str:
 """Generate a context string for inclusion in system prompts."""
 profile = self.load_profile(user_id)

 parts = [f"Returning user (session #{profile['session_count']})."]
 if profile["preferences"]:
 prefs = "; ".join(
 f"{k}: {v}" for k, v in profile["preferences"].items()
 )
 parts.append(f"Known preferences: {prefs}")
 if profile["facts"]:
 parts.append("Known facts: " + "; ".join(profile["facts"][-5:]))

 return " ".join(parts)

 def _create_default_profile(self, user_id: str) -> dict:
 return {
 "user_id": user_id,
 "created": datetime.now().isoformat(),
 "last_updated": datetime.now().isoformat(),
 "last_session": None,
 "session_count": 0,
 "preferences": {},
 "facts": [],
 "interaction_style": {}
 }

 @staticmethod
 def _format_conversation(conversation: list[dict]) -> str:
 return "\n".join(
 f"{m['role'].title()}: {m['content']}"
 for m in conversation
 )
Memory: User is vegetarian (relevance: 0.891) Memory: User loves Italian food (relevance: 0.847) Memory: User has a gluten allergy (relevance: 0.823)
Code Fragment 21.3.4: Define UserProfileManager; implement __init__, load_profile, save_profile
Warning: Privacy Considerations

User profile systems store personal information that may be subject to data protection regulations (GDPR, CCPA). Implement clear data retention policies, give users the ability to view and delete their profiles, minimize the data you store, and ensure appropriate encryption for data at rest. Never store sensitive information (health conditions, financial data, relationship details) without explicit consent and a clear justification.

7. Comparing Memory Approaches

7. Comparing Memory Approaches Intermediate
Approach Capacity Retrieval Complexity Best For
Sliding Window Fixed (last N turns) Recency only Low Short conversations, simple bots
Summarization Extended Most recent summary Medium Medium-length sessions
Vector Store Unlimited Semantic similarity Medium-High Multi-session, topic revisits
Entity Extraction Compact facts Key-value lookup Medium User profiles, preferences
MemGPT / Letta Unlimited + managed Agent-driven search High Complex, long-running agents
Hybrid (recommended) Tiered Recency + semantic High Production applications

8. Memory-as-a-Service Platforms

Building a production-grade memory system from scratch, as the code examples above demonstrate, requires significant engineering effort: embedding pipelines, vector stores, summarization logic, conflict resolution, and persistence layers. A growing category of "Memory-as-a-Service" (MaaS) platforms packages these capabilities into managed services, allowing developers to add persistent, intelligent memory to their applications with a few API calls instead of months of custom development.

Why does this shift matter? Just as managed vector databases (Pinecone, Weaviate) replaced DIY FAISS deployments for many teams, managed memory services are replacing DIY memory architectures for conversational AI. The platforms handle the hard engineering problems (deduplication, conflict resolution, importance scoring, forgetting) so that application developers can focus on the conversation experience.

8.1 Platform Comparison

8.1 Platform Comparison
Platform Architecture Key Features Best For
Mem0 Graph + vector hybrid memory layer Automatic memory extraction from conversations; user, session, and agent-level memories; graph-based relationships between memories; simple add/search API Applications needing personalization across sessions with minimal setup; teams that want "drop-in" memory
Zep Temporal knowledge graph + vector store Automatic entity extraction and relationship tracking; temporal awareness (memories have timestamps and decay); built-in summarization; dialog classification; integrates with LangChain and LlamaIndex Enterprise applications needing structured entity memory with temporal reasoning; compliance-sensitive use cases
MemGPT / Letta Agent-managed tiered memory (see Section 5 above) LLM-controlled memory management; three memory tiers (working, archival, recall); the agent decides when and what to remember; stateful agent sessions; open-source core Agentic applications where the AI needs to autonomously manage its own memory; complex, long-running assistants
# Mem0: Drop-in memory for any LLM application
# pip install mem0ai
from mem0 import Memory

# Initialize with default configuration
memory = Memory()

# Add memories from a conversation
conversation = [
 {"role": "user", "content": "I'm a vegetarian and I love Italian food"},
 {"role": "assistant", "content": "Great! I can suggest some vegetarian Italian dishes."},
 {"role": "user", "content": "I also have a gluten allergy"}
]

# Mem0 automatically extracts and stores relevant memories
memory.add(conversation, user_id="alice_123")

# Later, retrieve relevant memories for a new query
results = memory.search("What should I cook for dinner?", user_id="alice_123")
for r in results:
 print(f" Memory: {r['memory']} (relevance: {r['score']:.3f})")
# Output:
# Memory: User is vegetarian (relevance: 0.891)
# Memory: User loves Italian food (relevance: 0.847)
# Memory: User has a gluten allergy (relevance: 0.823)

# Zep: Entity-aware temporal memory
# pip install zep-cloud
from zep_cloud.client import Zep

zep = Zep(api_key="your-api-key")

# Add a session with messages
session = zep.memory.add_session(session_id="session_001", user_id="alice_123")
zep.memory.add(
 session_id="session_001",
 messages=[
 {"role": "user", "content": "My doctor recommended I eat more iron-rich foods"},
 {"role": "assistant", "content": "Spinach and lentils are great vegetarian sources of iron."}
 ]
)

# Zep automatically extracts entities and relationships:
# Entity: alice_123 -> has_condition: needs more iron
# Entity: alice_123 -> dietary_preference: vegetarian
# These are queryable and temporally aware
Code Fragment 21.3.5: Mem0: Drop-in memory for any LLM application
Key Insight

The choice between DIY memory and a managed platform depends on your control requirements. DIY (using the patterns from Sections 2 through 6) gives you full control over what is stored, how it is retrieved, and how it decays. Managed platforms trade control for speed of implementation and battle-tested edge case handling. For most production applications that need cross-session memory, starting with a managed platform and migrating to custom only if needed is the pragmatic path. The agent memory architectures in Section 22.1 extend these patterns to agentic use cases.

9. Memory Consolidation Patterns

Raw memory accumulation is not enough. Over time, a memory system that only adds and never consolidates will drown in redundant, contradictory, and stale information. Memory consolidation, inspired by how the human brain processes memories during sleep, periodically reviews, merges, and prunes stored memories to maintain a coherent and useful knowledge base.

Why does consolidation matter? Without it, a user who mentions "I like coffee" in 50 different conversations will have 50 nearly identical memory entries. A user who says "I prefer PostgreSQL" in January and "I switched to MongoDB" in March will have contradictory memories with no resolution. A system that remembers everything equally cannot distinguish a passing preference from a deeply held value.

9.1 Importance Scoring

Not all memories are equally valuable. Importance scoring assigns a weight to each memory based on factors like: how many times the information has been referenced, whether the user explicitly stated it versus it being inferred, the specificity of the information (a specific dietary restriction matters more than a general comment about food), and temporal relevance (recent preferences may override old ones). An LLM call can assess importance, or heuristic rules can provide a cheaper approximation.

9.2 Periodic Consolidation

Consolidation runs on a schedule (after every N conversations, or nightly for active users) and performs three operations: merge duplicate or near-duplicate memories into a single canonical entry, resolve conflicts between contradictory memories by keeping the more recent or more frequently reinforced version, and compress verbose memories into their essential content. This is analogous to the progressive summarization from Section 3, but applied to the memory store rather than the conversation history.

9.3 Forgetting and Decay

Deliberate forgetting is a feature, not a bug. Memories that have not been accessed or reinforced over a long period should decay in importance and eventually be archived or deleted. Ebbinghaus-inspired forgetting curves (where memory strength decays exponentially with time but resets on each retrieval) provide a principled model for this. The MemoryBank system (Zhong et al., 2024, cited in the bibliography) implements this approach with tunable decay rates.

# Memory consolidation pipeline
from openai import OpenAI
from datetime import datetime, timedelta

client = OpenAI()

def consolidate_memories(
 memories: list[dict],
 current_date: datetime = None
) -> list[dict]:
 """
 Consolidate a list of memories by merging duplicates,
 resolving conflicts, and applying decay.
 """
 current_date = current_date or datetime.now()

 # Step 1: Score importance
 for mem in memories:
 age_days = (current_date - mem["created"]).days
 access_count = mem.get("access_count", 1)

 # Decay: halve importance every 30 days without access
 decay = 0.5 ** (age_days / 30)
 # Boost for frequently accessed memories
 frequency_boost = min(2.0, 1.0 + 0.1 * access_count)
 # Explicit statements are more important than inferences
 source_weight = 1.5 if mem.get("source") == "explicit" else 1.0

 mem["importance"] = decay * frequency_boost * source_weight

 # Step 2: Merge duplicates and resolve conflicts using LLM
 memory_texts = "\n".join(
 f"[{m['importance']:.2f}] ({m['created'].strftime('%Y-%m-%d')}) {m['text']}"
 for m in sorted(memories, key=lambda x: -x["importance"])
 )

 response = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{
 "role": "user",
 "content": (
 "Review these user memories. For each group of related memories:\n"
 "1. Merge duplicates into one canonical entry\n"
 "2. When memories conflict, keep the more recent version\n"
 "3. Drop memories with importance below 0.1\n"
 "4. Preserve the most specific version of each fact\n\n"
 f"Memories:\n{memory_texts}\n\n"
 "Return the consolidated list as one memory per line."
 )
 }],
 temperature=0
 )

 consolidated_texts = response.choices[0].message.content.strip().split("\n")

 return [
 {"text": t.strip(), "created": current_date, "access_count": 0}
 for t in consolidated_texts if t.strip()
 ]

# Example: 3 memories about database preference over time
memories = [
 {"text": "User prefers PostgreSQL for databases",
 "created": datetime(2024, 1, 15), "access_count": 3, "source": "explicit"},
 {"text": "User mentioned liking PostgreSQL",
 "created": datetime(2024, 2, 1), "access_count": 1, "source": "inferred"},
 {"text": "User switched to MongoDB for their new project",
 "created": datetime(2024, 6, 10), "access_count": 2, "source": "explicit"},
]

consolidated = consolidate_memories(memories)
# Result: single entry reflecting the most recent preference
Buffer: 4 messages Overflow: 1 evicted [assistant] Hello Alice! [user] I work at Google as a ML engineer [assistant] That sounds exciting! [user] I'm interested in RAG systems
Code Fragment 21.3.6: Memory consolidation pipeline
Warning

Memory consolidation that aggressively prunes or overwrites can lose information the user considers important. Always err on the side of keeping too much rather than too little, and provide users with the ability to pin important memories so they are never subject to decay. When resolving conflicts, the consolidation system should prefer explicit statements over inferences and more recent information over older information, but it should also consider whether the older information might still be valid in a different context.

10. Evaluating Memory Quality

How do you know whether your memory system is actually working? Retrieval accuracy (did we find the right memory?) is necessary but not sufficient. A memory system must also return information that is timely (not stale), relevant (useful for the current context), and precise (not cluttered with tangential entries). Several benchmarks and metrics have emerged to evaluate these dimensions.

Memory and Long-Context Benchmarks

LongBench (Bai et al., 2024) evaluates LLMs across six long-context task categories, including multi-document QA, summarization, and code completion, with input lengths ranging from 4K to 20K+ tokens. It tests whether models can locate and use information buried deep in long inputs. InfiniteBench (Zhang et al., 2024) pushes further, testing contexts beyond 100K tokens with tasks that require reasoning over extremely distant information. MemBench focuses specifically on conversational memory, evaluating whether systems can recall user-stated facts, preferences, and prior decisions across multi-session dialogues. Together, these benchmarks reveal that raw context length is not the same as usable memory; a model may support 128K tokens but still fail to retrieve a fact stated at token position 20K.

Memory Quality Metrics

Beyond benchmark scores, production memory systems should track four operational metrics. Memory precision measures the fraction of retrieved memories that are actually relevant to the current query (high precision means few irrelevant entries clutter the context). Memory recall measures the fraction of relevant memories that are successfully retrieved (high recall means important information is not missed). Staleness tracks how often the system surfaces outdated information that has been superseded by newer data, such as returning an old address after the user has provided an updated one. Relevance decay measures how retrieval quality degrades as conversations grow longer and the memory store accumulates more entries. Monitoring these metrics over time reveals whether your memory system is improving or degrading as usage scales.

Note

Evaluating memory quality is inherently harder than evaluating retrieval accuracy because memory has a temporal dimension. A memory that was correct last week may be wrong today. Benchmark suites like MemBench include "preference update" test cases where the user explicitly changes a previously stated fact, testing whether the system correctly surfaces the updated version. The evaluation framework from Section 29.1 provides general-purpose metrics that complement the memory-specific measures described here.

Self-Check
Q1: Why is a simple "last N messages" buffer insufficient for most production chatbots?
Show Answer
A simple buffer treats all messages equally and uses a message count rather than a token count, which means it may either waste context space (short messages) or exceed it (long messages). More critically, it offers no way to preserve important information from earlier in the conversation once it falls outside the window. Important details like user preferences, prior decisions, or unresolved questions are lost. Token-aware sliding windows with summarization or vector retrieval address these limitations.
Q2: How does progressive summarization differ from a single summary of the entire conversation?
Show Answer
Progressive summarization maintains multiple levels of compression: recent messages in full fidelity, slightly older messages summarized into paragraphs, and much older content compressed into key facts. A single summary of the entire conversation would lose the detail gradient, treating recent context with the same compression as ancient history. Progressive summarization preserves high resolution where it matters most (the recent past) while still retaining the essence of earlier exchanges.
Q3: What is the key innovation of the MemGPT/Letta architecture?
Show Answer
MemGPT/Letta gives the LLM agent itself the ability to manage its own memory through function calls. Instead of application code deciding what to save or retrieve, the agent decides when to write information to archival memory, search for past context, update its working memory, or page through conversation history. This is inspired by operating system virtual memory, creating the illusion of unlimited memory through intelligent paging between fast (context window) and slow (external storage) tiers.
Q4: Why should vector store retrieval include a recency bias?
Show Answer
Pure semantic similarity can retrieve very old memories that, while topically relevant, may be outdated or superseded by more recent information. For example, a user might have changed their preference from PostgreSQL to MongoDB in a recent conversation, but a pure similarity search for "database" would return both old and new preferences equally. A recency bias ensures that more recent memories receive a score boost, so up-to-date information is preferred when multiple relevant memories exist.
Q5: What privacy considerations apply to user profile systems in conversational AI?
Show Answer
User profile systems must comply with data protection regulations (GDPR, CCPA) by implementing clear data retention policies, providing users the ability to view and delete their profiles, minimizing stored data to what is necessary, encrypting data at rest, and obtaining explicit consent before storing sensitive information. Developers should distinguish between information the user has explicitly shared versus information inferred from conversation patterns, and be especially cautious with health, financial, and relationship data.
Tip: Implement Graceful Fallbacks

When the model is uncertain or the query is out of scope, have it say so rather than hallucinate. Train your system to recognize low-confidence responses and route to human agents, FAQ pages, or clarification requests.

Key Takeaways
Real-World Scenario: Implementing Conversation Memory for a Financial Advisory Chatbot

Who: An ML engineer at a wealth management fintech serving 50,000 clients

Situation: Clients expected the chatbot to remember their portfolio preferences, risk tolerance, and prior conversations across sessions. A client who said "I told you last month I want to avoid tech stocks" expected that preference to persist.

Problem: Storing full conversation histories consumed the entire 128K context window within 3 to 4 sessions. Truncating older messages caused the bot to "forget" critical preferences and repeat questions, frustrating high-value clients.

Dilemma: Summarizing old conversations compressed them effectively but lost specific details (exact allocation percentages, named stocks). A vector-based retrieval memory preserved details but sometimes surfaced irrelevant old context that confused the current conversation.

Decision: They implemented a three-tier memory system: (1) a structured client profile storing key facts as explicit key-value pairs (risk tolerance: moderate, sector exclusions: [tech, tobacco]), (2) a rolling summary of the last 5 sessions, and (3) a vector store of all conversation turns for on-demand retrieval when the client referenced a specific past discussion.

How: After each session, an LLM extraction step updated the structured profile with any new preferences. The system prompt always included the profile and recent summary; vector retrieval was triggered only when the user explicitly referenced past conversations.

Result: Client satisfaction scores rose from 3.6 to 4.4 out of 5. The "repeated question" complaint rate dropped from 23% to 3%. Context window usage stayed under 40K tokens even for clients with 50+ sessions.

Lesson: Tiered memory (structured facts, summaries, and searchable archives) outperforms any single strategy because different types of information have different retrieval patterns and retention requirements.

Lab: Build a Conversational Agent with Layered Memory

Duration: ~75 minutes Intermediate

Objective

Build a chatbot with a three-tier memory system: a short-term sliding window buffer, an LLM-powered conversation summarizer, and a vector-based long-term memory store for fact retrieval across sessions.

What You'll Practice

  • Implementing a sliding window buffer for short-term conversation memory
  • Building an LLM-powered progressive conversation summarizer
  • Creating a vector-based long-term memory for user fact extraction and retrieval
  • Combining memory layers into a unified context for the LLM

Setup

The following cell installs the required packages and configures the environment for this lab.

pip install openai sentence-transformers numpy
Code Fragment 21.3.7: This command installs openai, sentence-transformers, and numpy for the multi-layer memory chatbot lab. These packages provide the LLM API for summarization and chat, embedding models for long-term memory search, and numerical similarity computation.

Steps

Step 1: Build the short-term memory buffer

Create a sliding window that keeps the most recent N messages.

class ShortTermMemory:
 def __init__(self, max_turns=10):
 self.messages = []
 self.max_turns = max_turns
 self.overflow = []

 def add(self, role, content):
 self.messages.append({"role": role, "content": content})
 # Evict oldest messages when buffer is full
 while len(self.messages) > self.max_turns:
 self.overflow.append(self.messages.pop(0))

 def get_messages(self):
 return list(self.messages)

 def get_overflow(self):
 """Return and clear evicted messages for summarization."""
 evicted = list(self.overflow)
 self.overflow.clear()
 return evicted

# Test
stm = ShortTermMemory(max_turns=4)
stm.add("user", "Hi, my name is Alice")
stm.add("assistant", "Hello Alice!")
stm.add("user", "I work at Google as a ML engineer")
stm.add("assistant", "That sounds exciting!")
stm.add("user", "I'm interested in RAG systems")

print(f"Buffer: {len(stm.get_messages())} messages")
print(f"Overflow: {len(stm.overflow)} evicted")
for m in stm.get_messages():
 print(f" [{m['role']}] {m['content']}")
Code Fragment 21.3.8: Defining ShortTermMemory
Hint

The overflow list collects messages that have been evicted from the buffer. These messages should be summarized before they are permanently discarded. The get_overflow() method returns and clears this list.

Step 2: Build the conversation summarizer

Create a component that progressively summarizes old conversation turns.

from openai import OpenAI

client = OpenAI()

class Summarizer:
 def __init__(self):
 self.running_summary = ""

 def summarize(self, messages):
 if not messages:
 return self.running_summary

 text = "\n".join(f"{m['role'].title()}: {m['content']}"
 for m in messages)

 prompt = (
 f"Current summary:\n"
 f"{self.running_summary or '(none yet)'}\n\n"
 f"New turns to incorporate:\n{text}\n\n"
 f"Write an updated summary capturing all key facts and "
 f"user preferences. Keep it to 2 to 4 sentences."
 )

 resp = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{"role": "user", "content": prompt}],
 temperature=0.3, max_tokens=200)
 self.running_summary = resp.choices[0].message.content
 return self.running_summary

# Test
summarizer = Summarizer()
summary = summarizer.summarize([
 {"role": "user", "content": "My name is Alice, I work at Google"},
 {"role": "assistant", "content": "Nice to meet you!"},
 {"role": "user", "content": "I need help building a RAG system"},
])
print(f"Summary: {summary}")
Summary: Alice works at Google and is looking for help building a RAG (Retrieval-Augmented Generation) system.
Code Fragment 21.3.9: Defining Summarizer
Hint

Progressive summarization is key: each time new messages overflow, incorporate them into the existing summary rather than re-summarizing everything. This keeps cost constant regardless of conversation length.

Step 3: Build the long-term vector memory

Create a searchable store that extracts and indexes key facts.

from sentence_transformers import SentenceTransformer
import numpy as np

class LongTermMemory:
 def __init__(self):
 self.facts = []
 self.embeddings = None
 self.model = SentenceTransformer("all-MiniLM-L6-v2")

 def extract_and_store(self, messages):
 text = "\n".join(f"{m['role'].title()}: {m['content']}"
 for m in messages)
 # TODO: Use LLM to extract factual statements about the user
 resp = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{"role": "user",
 "content": f"Extract key facts about the user from this "
 f"conversation. One fact per line.\n\n{text}"}],
 temperature=0.3, max_tokens=200)
 new_facts = [f.strip().lstrip("- ")
 for f in resp.choices[0].message.content.strip().split("\n")
 if f.strip()]
 if new_facts:
 self.facts.extend(new_facts)
 self.embeddings = self.model.encode(self.facts)
 return new_facts

 def search(self, query, top_k=3):
 if not self.facts or self.embeddings is None:
 return []
 qe = self.model.encode(query)
 scores = np.dot(self.embeddings, qe) / (
 np.linalg.norm(self.embeddings, axis=1) * np.linalg.norm(qe))
 idx = np.argsort(scores)[::-1][:top_k]
 return [(self.facts[i], scores[i]) for i in idx if scores[i] > 0.3]

# Test
ltm = LongTermMemory()
facts = ltm.extract_and_store([
 {"role": "user", "content": "I'm Alice, ML engineer at Google"},
 {"role": "user", "content": "I prefer PyTorch over TensorFlow"},
])
print(f"Extracted: {facts}")
print(f"Search 'employer': {ltm.search('Where does Alice work?')}")
Extracted: ['Alice is an ML engineer', 'Alice works at Google', 'Alice prefers PyTorch over TensorFlow'] Search 'employer': [('Alice works at Google', 0.7823), ('Alice is an ML engineer', 0.5214)]
Code Fragment 21.3.10: Defining LongTermMemory
Hint

The extraction prompt should target specific, factual statements like "Alice works at Google" rather than impressions. The 0.3 similarity threshold filters out irrelevant results during search.

Step 4: Wire everything into a memory-augmented chatbot

Combine all three memory layers into a working conversational agent.

class MemoryChat:
 def __init__(self):
 self.stm = ShortTermMemory(max_turns=6)
 self.summarizer = Summarizer()
 self.ltm = LongTermMemory()

 def chat(self, user_message):
 # Search long-term memory for relevant facts
 relevant = self.ltm.search(user_message)
 facts_text = "\n".join(f"- {f}" for f, _ in relevant) or "None yet."

 # Build context-enriched system prompt
 sys = (f"You are a helpful assistant with memory.\n"
 f"Summary: {self.summarizer.running_summary or 'New conversation.'}\n"
 f"Relevant facts:\n{facts_text}")

 # Assemble message list
 msgs = [{"role": "system", "content": sys}]
 msgs.extend(self.stm.get_messages())
 msgs.append({"role": "user", "content": user_message})

 # Generate response
 resp = client.chat.completions.create(
 model="gpt-4o-mini", messages=msgs,
 temperature=0.7, max_tokens=300)
 reply = resp.choices[0].message.content

 # Update memories
 self.stm.add("user", user_message)
 self.stm.add("assistant", reply)

 # Process overflow
 overflow = self.stm.get_overflow()
 if overflow:
 self.summarizer.summarize(overflow)
 self.ltm.extract_and_store(overflow)

 return reply

# Run a multi-turn conversation
bot = MemoryChat()
conversation = [
 "Hi! I'm Bob, a data scientist at Netflix.",
 "I'm building a recommendation system using collaborative filtering.",
 "We have about 50 million user interaction records.",
 "Considering switching from TensorFlow to PyTorch.",
 "I prefer Python 3.11 for stability.",
 "Can you remind me what I said I was working on?",
 "What company do I work at?",
]

for msg in conversation:
 print(f"\nUser: {msg}")
 reply = bot.chat(msg)
 print(f"Bot: {reply}")
 print(f" [STM: {len(bot.stm.get_messages())} | "
 f"Summary: {len(bot.summarizer.running_summary)} chars | "
 f"LTM: {len(bot.ltm.facts)} facts]")
User: Hi! I'm Bob, a data scientist at Netflix. Bot: Hello Bob! Great to meet you. What can I help you with today? [STM: 2 | Summary: 0 chars | LTM: 0 facts] User: I'm building a recommendation system using collaborative filtering. Bot: That sounds like a great project! Collaborative filtering is... [STM: 4 | Summary: 0 chars | LTM: 0 facts] ... User: Can you remind me what I said I was working on? Bot: You mentioned you're building a recommendation system using collaborative filtering at Netflix, working with about 50 million user interaction records. [STM: 6 | Summary: 124 chars | LTM: 4 facts] User: What company do I work at? Bot: You work at Netflix! [STM: 6 | Summary: 124 chars | LTM: 4 facts]
Code Fragment 21.3.11: Defining MemoryChat
Hint

The last two questions test memory recall. "What am I working on?" should be answered from the summary (if the message was already evicted from the buffer). "What company?" should come from long-term memory fact retrieval.

Expected Output

  • The chatbot maintains coherent conversation across all turns
  • "What company do I work at?" correctly retrieves "Netflix" from long-term memory
  • The running summary progressively captures key user facts
  • Long-term memory stores specific facts like "Bob is a data scientist" and "Bob works at Netflix"

Stretch Goals

  • Add session persistence: save summary and facts to JSON for cross-restart memory
  • Implement a forgetting mechanism that deprioritizes old facts by access recency
  • Add episodic memory: store complete conversation episodes retrievable by date or topic
Complete Solution
from openai import OpenAI
from sentence_transformers import SentenceTransformer
import numpy as np

client = OpenAI()

class ShortTermMemory:
 def __init__(self, max_turns=6):
 self.messages, self.max_turns, self.overflow = [], max_turns, []
 def add(self, role, content):
 self.messages.append({"role": role, "content": content})
 while len(self.messages) > self.max_turns:
 self.overflow.append(self.messages.pop(0))
 def get_messages(self): return list(self.messages)
 def get_overflow(self):
 e = list(self.overflow); self.overflow.clear(); return e

class Summarizer:
 def __init__(self): self.running_summary = ""
 def summarize(self, msgs):
 if not msgs: return self.running_summary
 t = "\n".join(f"{m['role']}: {m['content']}" for m in msgs)
 r = client.chat.completions.create(model="gpt-4o-mini",
 messages=[{"role":"user","content":f"Summary:\n{self.running_summary or '(none)'}\n\nNew:\n{t}\n\nUpdate (2-4 sentences):"}],
 temperature=0.3, max_tokens=200)
 self.running_summary = r.choices[0].message.content
 return self.running_summary

class LongTermMemory:
 def __init__(self):
 self.facts, self.embeddings = [], None
 self.model = SentenceTransformer("all-MiniLM-L6-v2")
 def extract_and_store(self, msgs):
 t = "\n".join(f"{m['role']}: {m['content']}" for m in msgs)
 r = client.chat.completions.create(model="gpt-4o-mini",
 messages=[{"role":"user","content":f"Extract user facts (one per line):\n{t}"}],
 temperature=0.3, max_tokens=200)
 nf = [f.strip().lstrip("- ") for f in r.choices[0].message.content.strip().split("\n") if f.strip()]
 if nf: self.facts.extend(nf); self.embeddings = self.model.encode(self.facts)
 return nf
 def search(self, q, k=3):
 if not self.facts: return []
 qe = self.model.encode(q)
 s = np.dot(self.embeddings, qe)/(np.linalg.norm(self.embeddings,axis=1)*np.linalg.norm(qe))
 idx = np.argsort(s)[::-1][:k]
 return [(self.facts[i],s[i]) for i in idx if s[i]>0.3]

class MemoryChat:
 def __init__(self):
 self.stm, self.sum, self.ltm = ShortTermMemory(6), Summarizer(), LongTermMemory()
 def chat(self, msg):
 rel = self.ltm.search(msg)
 facts = "\n".join(f"- {f}" for f,_ in rel) or "None"
 sys = f"Helpful assistant with memory.\nSummary: {self.sum.running_summary or 'New.'}\nFacts:\n{facts}"
 ms = [{"role":"system","content":sys}] + self.stm.get_messages() + [{"role":"user","content":msg}]
 r = client.chat.completions.create(model="gpt-4o-mini",messages=ms,temperature=0.7,max_tokens=300)
 reply = r.choices[0].message.content
 self.stm.add("user",msg); self.stm.add("assistant",reply)
 ov = self.stm.get_overflow()
 if ov: self.sum.summarize(ov); self.ltm.extract_and_store(ov)
 return reply

bot = MemoryChat()
for m in ["Hi! I'm Bob, data scientist at Netflix.","Building a rec system with collab filtering.",
 "50M interaction records.","Switching TF to PyTorch.","Prefer Python 3.11.",
 "What am I working on?","What company do I work at?"]:
 print(f"\nUser: {m}\nBot: {bot.chat(m)}")
User: Hi! I'm Bob, data scientist at Netflix. Bot: Hello Bob! Nice to meet you. User: Building a rec system with collab filtering. Bot: Collaborative filtering is a great approach for recommendations... ... User: What am I working on? Bot: You're building a recommendation system using collaborative filtering with about 50 million interaction records. User: What company do I work at? Bot: You work at Netflix!
Code Fragment 21.3.12: Defining ShortTermMemory, Summarizer, LongTermMemory
Research Frontier

Retrieval-augmented memory stores conversation history in a vector database and retrieves relevant past exchanges based on the current query, enabling effectively unlimited conversation length. Hierarchical memory architectures maintain multiple memory tiers (working memory, episodic memory, semantic memory) inspired by cognitive science, with different retention and retrieval policies for each tier. Memory compression with LLMs uses a smaller model to continuously summarize and consolidate conversation history, keeping the most important information within the context window. Research into shared memory across conversations is developing methods for agents to accumulate knowledge about users across sessions without violating privacy constraints.

Exercises

These exercises cover memory architectures for conversational AI. The memory patterns here also apply to agent memory (Section 22.1).

Exercise 21.3.1: Memory types Conceptual

Explain the difference between short-term memory, long-term memory, and session persistence in a conversational system. Give an example of information stored in each.

Show Answer

Short-term: recent conversation turns, kept in the context window (e.g., "I just asked about pricing"). Long-term: persistent knowledge from past conversations (e.g., "user prefers dark mode"). Session persistence: state that survives between sessions (e.g., "user's order ID from yesterday").

Exercise 21.3.2: Sliding window vs. summarization Conceptual

Compare a sliding window (keep last N messages) with a summarization approach (compress old messages). What are the tradeoffs in terms of information loss, cost, and latency?

Show Answer

Sliding window: no compute cost, no information distortion, but loses all information beyond the window. Summarization: retains key information from older turns, but adds LLM call latency and cost, and summaries may lose nuance or introduce errors.

Exercise 21.3.3: Vector store memory Conceptual

Explain how embedding-based memory retrieval works. A user mentioned their dog's name 50 messages ago. How would vector store memory retrieve this when the user says "How is Buddy doing?"

Show Answer

Each message is embedded and stored with its text. When the user says "How is Buddy doing?", the embedding of this query is similar to the embedding of the earlier message "My dog Buddy loves walks in the park." Vector search retrieves the relevant past message even though it was far back in the conversation.

Exercise 21.3.4: MemGPT/Letta Conceptual

Describe the three memory tiers in the MemGPT architecture. How does the LLM decide when to move information between tiers?

Show Answer

Working context (in-context, small, actively used), archival memory (persistent vector store, large, searchable), recall memory (conversation history, searchable by time). The LLM uses function calls (memory_write, memory_search, memory_update) to manage tiers, triggered by its own assessment of what information to retain or retrieve.

Exercise 21.3.5: Memory consolidation Conceptual

What is memory consolidation, and why is it necessary for long-running conversations? How does it differ from simple summarization? How do these patterns extend to agent memory in Chapter 22?

Show Answer

Memory consolidation merges, deduplicates, and resolves conflicts across multiple memory entries. Unlike summarization (which compresses one conversation), consolidation operates across sessions, updating facts (e.g., "user moved from NYC to LA" should replace, not coexist with, the old location). Agent memory systems in Chapter 22 apply the same tiered patterns but add tool-use context and task completion tracking.

Exercise 21.3.6: Sliding window Coding

Implement a conversation manager that maintains a sliding window of the last 10 messages. When the window overflows, summarize the oldest 5 messages using an LLM and prepend the summary.

Exercise 21.3.7: Vector memory Coding

Build a memory system that embeds each user message and stores it in a vector database. On each new message, retrieve the 3 most relevant past messages and include them in the context.

Exercise 21.3.8: User profile extraction Coding

Create a system that extracts user preferences (name, interests, preferences) from conversation turns and maintains a structured user profile that persists across sessions.

Exercise 21.3.9: Memory evaluation Coding

Build a test harness that measures memory quality: insert 20 specific facts across a 50-turn conversation, then ask 20 recall questions. Measure what percentage of facts the system correctly remembers. Compare sliding window, summarization, and vector memory approaches.

What Comes Next

In the next section, Section 21.4: Multi-Turn Dialogue & Conversation Flows, we examine multi-turn dialogue and conversation flow management, handling complex interactions with branching logic.

References & Further Reading

Packer, C. et al. (2023). "MemGPT: Towards LLMs as Operating Systems." arXiv preprint.

Proposes treating LLMs like operating systems with tiered memory management. Introduces virtual context management for handling conversations beyond context limits. Foundational for long-context dialogue systems.

Paper

Park, J.S. et al. (2023). "Generative Agents: Interactive Simulacra of Human Behavior." UIST 2023.

Creates believable agents with memory retrieval, reflection, and planning capabilities. Demonstrates how memory architectures enable emergent social behaviors. Influential for anyone building agents with persistent memory.

Paper

Zhong, W. et al. (2024). "MemoryBank: Enhancing Large Language Models with Long-Term Memory." AAAI 2024.

Introduces a memory bank mechanism that stores and retrieves past interactions adaptively. Includes forgetting curves inspired by cognitive science. Practical architecture for long-term conversational memory.

Paper

Maharana, A. et al. (2024). "Evaluating Very Long-Term Conversational Memory of LLM Agents." ACL 2024.

Provides benchmarks and evaluation methods for long-term conversational memory. Tests whether agents can maintain coherence over hundreds of turns. Essential for teams validating memory system performance.

Paper

Letta (formerly MemGPT) Documentation.

The production platform built on the MemGPT research, offering tiered memory and stateful agents. Provides APIs for building agents with persistent memory. Recommended for production memory-augmented assistants.

Tool

Zep: Long-Term Memory for AI Assistants.

A memory layer for AI applications with automatic summarization, entity extraction, and temporal awareness. Integrates with popular LLM frameworks. Practical choice for adding memory to existing chatbots.

Tool