Part X: Frontiers
Chapter 34: Emerging Architectures & Scaling Frontiers

Memory as a Computational Primitive

"A mind without memory is a mind that solves the same puzzle every morning, always surprised by the answer."

Frontier Frontier, Memory Haunted AI Agent
Big Picture

The context window is the fundamental bottleneck of modern LLMs. No matter how large (128K, 1M, or 10M tokens), a finite context window means that information eventually falls out of scope. Memory systems that persist information across conversations, sessions, and tasks transform an LLM from a stateless function into a stateful agent. This section examines memory not as an engineering convenience but as a computational primitive that determines what an LLM system can and cannot compute.

Prerequisites

This section builds on the transformer architecture (especially the KV cache), RAG foundations from Chapter 15, and the AI agents material from Chapter 22. The discussion of reasoning in Section 34.5 provides useful context for understanding why memory matters for computation.

A robot brain in cross-section showing three memory compartments: a small bright workspace for active items, a medium filing cabinet with labeled folders from past conversations, and a vast library representing trained knowledge, with a conveyor belt between them
Figure 34.6.1: Three tiers of LLM memory: the context window (working memory), episodic memory from past sessions, and the vast parametric knowledge encoded during training, each with different capacity and access patterns.

1. The Memory Problem in LLMs

Have you ever had a conversation with a chatbot, spent twenty minutes explaining your project, and then watched it forget everything the moment you opened a new session? That experience captures the fundamental memory problem in LLMs. A standard transformer-based LLM is, at its core, a stateless function: given a sequence of tokens, it produces a probability distribution over the next token. Between API calls, nothing persists. The context window creates an illusion of memory within a single conversation, but this "memory" has severe limitations.

First, it is bounded: once the context fills up, older information must be evicted or the conversation must be truncated. Second, it is expensive: the attention mechanism has $O(n^2)$ complexity in context length, making long contexts slow and costly. Third, it is volatile: when the session ends, everything is lost. Fourth, it is unstructured: all information in the context (instructions, conversation history, retrieved documents, tool outputs) competes for the same attention capacity, with no mechanism for prioritization.

These limitations are not merely inconvenient; they are computationally fundamental. A system with bounded memory can only recognize regular languages (in the formal language theory sense). To compute anything more complex, some form of external, unbounded memory is required. This is the theoretical motivation for memory-augmented LLM systems.

2. A Taxonomy of Memory Architectures

Memory systems for LLMs can be organized along several dimensions: persistence (ephemeral vs. long-term), structure (unstructured text vs. structured databases), access pattern (sequential vs. random), and integration point (in-context vs. external retrieval). The following taxonomy covers the major approaches as of 2026.

In-Context Memory

The simplest form of memory is stuffing information into the context window. This is what every chatbot does when it prepends conversation history to the current message. The approach is straightforward and requires no additional infrastructure, but it scales poorly. As the conversation grows, older messages must be summarized or dropped.

Improvements to in-context memory include sliding window approaches (keeping only the last $k$ messages), summarization (periodically compressing old messages into a summary), and hierarchical context management (maintaining a short-term buffer of recent messages plus a long-term summary). These techniques are widely used in production systems, as discussed in Chapter 13.

Retrieval-Augmented Memory

Retrieval-Augmented Generation (RAG), covered extensively in Chapters 15 through 18, can be understood as an external memory system. The vector store serves as long-term memory, the retrieval step serves as memory recall, and the context window serves as working memory where retrieved information is processed. This framing highlights RAG's strengths (unbounded storage, persistent across sessions) and weaknesses (retrieval may miss relevant information, retrieved content competes for context space with other inputs).

MemGPT and Virtual Context Management

MemGPT (Packer et al., 2023), later developed into the Letta framework, introduced an operating-system-inspired approach to LLM memory. The core insight is that the context window is analogous to RAM in a computer: fast but limited. Long-term storage (a database, file system, or vector store) is analogous to disk: slow but vast. MemGPT manages the movement of information between these tiers automatically, using the LLM itself to decide what to page in and page out of context.

The system maintains several memory tiers: a system prompt (analogous to the OS kernel, always in context), a working context (a fixed-size scratchpad the model can read and write), a conversation buffer (recent messages), and an archival store (unbounded long-term memory backed by a vector database). The LLM can issue function calls to search archival memory, insert new memories, and update the working context. This gives the model explicit control over its own memory management.

The following example demonstrates the MemGPT-style memory management pattern.

# MemGPT-style hierarchical memory: core, working, conversation, and archival tiers.
# Each tier has a token budget; the manager evicts from conversation (FIFO)
# and searches archival memory via embeddings when context overflows.
from dataclasses import dataclass, field
from datetime import datetime

@dataclass
class MemoryTier:
 """A single tier in a hierarchical memory system."""
 name: str
 max_tokens: int
 content: list[str] = field(default_factory=list)

 @property
 def current_tokens(self) -> int:
 return sum(len(s.split()) for s in self.content)

 @property
 def is_full(self) -> bool:
 return self.current_tokens >= self.max_tokens

@dataclass
class HierarchicalMemory:
 """MemGPT-inspired hierarchical memory manager.

 Tiers:
 - core: always in context (system instructions, user profile)
 - working: mutable scratchpad the model controls
 - conversation: recent message buffer (FIFO eviction)
 - archival: unlimited long-term store with semantic search
 """
 core: MemoryTier = field(
 default_factory=lambda: MemoryTier("core", max_tokens=500)
 )
 working: MemoryTier = field(
 default_factory=lambda: MemoryTier("working", max_tokens=300)
 )
 conversation: MemoryTier = field(
 default_factory=lambda: MemoryTier("conversation", max_tokens=2000)
 )
 archival: list[dict] = field(default_factory=list)

 def add_message(self, role: str, content: str) -> list[str]:
 """Add a message to conversation memory, evicting if needed."""
 evicted = []
 self.conversation.content.append(f"[{role}]: {content}")

 # Evict oldest messages when buffer is full
 while self.conversation.is_full and len(self.conversation.content) > 1:
 old = self.conversation.content.pop(0)
 evicted.append(old)

 return evicted

 def update_working_memory(self, key: str, value: str):
 """Update a key-value pair in working memory (model-controlled)."""
 # Remove existing entry with same key
 self.working.content = [
 entry for entry in self.working.content
 if not entry.startswith(f"{key}:")
 ]
 self.working.content.append(f"{key}: {value}")

 def archive(self, content: str, metadata: dict | None = None):
 """Store content in archival (long-term) memory."""
 self.archival.append({
 "content": content,
 "timestamp": datetime.now().isoformat(),
 "metadata": metadata or {},
 })

 def search_archival(self, query: str, top_k: int = 3) -> list[str]:
 """Search archival memory (simplified; real system uses embeddings)."""
 # Production implementation would use vector similarity
 results = []
 query_terms = set(query.lower().split())
 for entry in self.archival:
 entry_terms = set(entry["content"].lower().split())
 overlap = len(query_terms & entry_terms)
 if overlap > 0:
 results.append((overlap, entry["content"]))
 results.sort(reverse=True)
 return [r[1] for r in results[:top_k]]

 def build_context(self) -> str:
 """Assemble the full context from all memory tiers."""
 sections = []
 sections.append("=== CORE MEMORY ===")
 sections.extend(self.core.content)
 sections.append("\n=== WORKING MEMORY ===")
 sections.extend(self.working.content)
 sections.append("\n=== CONVERSATION ===")
 sections.extend(self.conversation.content)
 return "\n".join(sections)

# Usage example
memory = HierarchicalMemory()
memory.core.content = [
 "You are a helpful research assistant.",
 "User preference: concise answers with citations.",
]
memory.update_working_memory("current_topic", "memory architectures in LLMs")
memory.update_working_memory("user_expertise", "graduate-level ML")

# Simulate a conversation
memory.add_message("user", "Tell me about MemGPT.")
memory.add_message("assistant", "MemGPT uses OS-inspired virtual memory...")

# Archive important information for long-term recall
memory.archive(
 "User is writing a survey paper on memory-augmented transformers.",
 metadata={"source": "conversation", "importance": "high"},
)

print(memory.build_context())
=== CORE MEMORY === You are a helpful research assistant. User preference: concise answers with citations. === WORKING MEMORY === current_topic: memory architectures in LLMs user_expertise: graduate-level ML === CONVERSATION === [user]: Tell me about MemGPT. [assistant]: MemGPT uses OS-inspired virtual memory...
Code Fragment 34.6.1: MemGPT-style hierarchical memory: core, working, conversation, and archival tiers.

3. Working Memory vs. Long-Term Memory

The distinction between working memory and long-term memory, borrowed from cognitive psychology, maps onto concrete engineering choices in LLM systems.

Working memory is the information actively available for computation during a single inference step. In an LLM, this is the context window. Its capacity is limited (measured in tokens), its access is fast (the attention mechanism processes all tokens in parallel), and its contents are determined by the current task context.

Long-term memory is information that persists across inference steps, conversations, and sessions. In an LLM system, this includes vector stores, databases, knowledge graphs, and file systems. Its capacity is effectively unbounded, but access requires an explicit retrieval step (a tool call, a RAG query, or a memory management action), which introduces latency and the possibility of retrieval failure.

Memory Consolidation

In neuroscience, memory consolidation is the process by which short-term memories are transformed into stable long-term memories, typically during sleep. An analogous process is needed in LLM systems: important information from conversations should be distilled and stored in long-term memory, while unimportant details should be allowed to decay.

Current approaches to memory consolidation in LLM systems include periodic summarization (running the conversation through an LLM to extract key facts and decisions), importance scoring (using a model to rate the significance of each piece of information before deciding whether to archive it), and deduplication (merging new information with existing memories to avoid redundancy). These are still ad hoc engineering solutions; a principled theory of what to remember and what to forget is an open research problem.

Forgetting Curves and Memory Decay

Ebbinghaus's forgetting curve, one of the oldest findings in experimental psychology, shows that memory retention decays exponentially over time without reinforcement. Some memory systems for LLMs implement an analogous decay mechanism: memories that are never accessed gradually lose priority in retrieval, while frequently accessed memories are reinforced. This prevents the archival store from growing without bound and ensures that the most relevant memories are retrieved first.

The practical implementation typically uses a recency-weighted retrieval score:

$$\text{score}(m, q) = \alpha \cdot \text{sim}(m, q) + (1 - \alpha) \cdot \text{recency}(m)$$

where $\text{sim}(m, q)$ is the semantic similarity between memory $m$ and query $q$, $\text{recency}(m)$ is a time-decay factor (e.g., exponential decay with a configurable half-life), and $\alpha$ controls the tradeoff between relevance and freshness. This is similar to the retrieval scoring discussed in Chapter 17, extended with temporal dynamics.

4. Memory-Augmented Transformers

Beyond system-level memory management, several research directions explore integrating memory directly into the transformer architecture itself.

Memorizing Transformers

Wu et al. (2022) introduced Memorizing Transformers, which augment the standard attention mechanism with a non-differentiable memory of past key-value pairs stored in an external database. During inference, the model first attends to its local context (the standard context window) and then performs an approximate nearest-neighbor lookup into the external memory, attending to the most relevant past key-value pairs. This effectively extends the context length by orders of magnitude without increasing the quadratic attention cost.

The approach has a clean theoretical interpretation: it separates the computational cost of attention (which remains bounded by the local context size) from the information capacity of the model (which scales with the size of the external memory). The tradeoff is that retrieval from external memory introduces a quantization error (only the top-$k$ nearest neighbors are retrieved, so some relevant information may be missed).

Recurrent Memory Transformers

Bulatov et al. (2023) proposed Recurrent Memory Transformers (RMT), which process long documents in segments, maintaining a set of special "memory tokens" that carry information from one segment to the next. After processing each segment, the model updates the memory tokens, which are then prepended to the next segment. This creates a recurrent information pathway that allows the model to carry information across arbitrarily long sequences, limited only by the information capacity of the memory tokens.

RMT demonstrates that a transformer can process sequences of over 1 million tokens with a fixed context window of a few thousand tokens, by compressing information into memory tokens at each step. The compression is lossy, so the model cannot recall arbitrary details from early in the sequence, but it can maintain high-level context and track important entities across the full length.

Infini-Attention

Google's Infini-Attention (Munkhdalai et al., 2024) combines local attention within a segment with a compressive memory that summarizes all previous segments. The compressive memory uses an associative memory scheme (similar to a linear attention mechanism operating over a compressed representation of past context). This allows the model to balance detailed attention over recent tokens with compressed access to the full history.

Key Insight

All memory architectures face the same fundamental tradeoff: information capacity versus retrieval fidelity. Storing everything is possible (disk is cheap), but retrieving the right piece of information at the right time is hard. The context window provides perfect retrieval fidelity (everything in context is attended to) but limited capacity. External memory provides unlimited capacity but imperfect retrieval. The art of memory system design lies in managing this tradeoff for the specific application's needs. For a conversational assistant, approximate retrieval of past topics may suffice. For a legal document analyst, precise retrieval of specific clauses is essential.

5. Theoretical Foundations: Memory and Computation

The connection between memory and computation is one of the deepest insights in computer science, and it applies directly to the question of what LLMs can compute.

Turing Completeness and External Memory

A standard transformer with a fixed context window is not Turing-complete. It can only recognize a subset of regular languages (Merrill, 2023). This is because the model's "state" at any point is a fixed-size vector (the hidden states across the context window), which cannot encode an arbitrarily large amount of information.

Adding unbounded external memory, and giving the model the ability to read from and write to that memory, makes the system Turing-complete. This is directly analogous to the distinction between a finite automaton (bounded memory, limited computation) and a Turing machine (unbounded tape, universal computation). MemGPT-style systems, which give the model explicit read/write access to an external store, are Turing-complete in principle (though not in practice, due to imperfect memory management by the LLM).

This theoretical perspective explains why memory-augmented LLM systems can solve problems that pure LLMs cannot. A pure LLM cannot reliably sort a list of 1000 numbers (it would require maintaining and updating state far beyond its context capacity). An LLM with external memory and code execution tools can sort lists of any size by writing a sorting algorithm and executing it.

The Neural Turing Machine Legacy

The idea of augmenting neural networks with external memory dates back to the Neural Turing Machine (Graves et al., 2014) and its successor, the Differentiable Neural Computer (DNC). These architectures included a differentiable memory matrix that the network could read from and write to using soft attention. While they were ahead of their time and difficult to train at scale, the core insight, that neural networks need external memory to perform complex computation, has been validated by the success of modern memory-augmented LLM systems.

The key difference between NTMs/DNCs and modern memory-augmented LLMs is the memory management mechanism. NTMs used learned, continuous read/write operations. Modern systems use discrete tool calls (search, insert, update) managed by the LLM itself. The discrete approach is less elegant but far more practical: it integrates with existing infrastructure (databases, vector stores, file systems) and scales to production workloads.

6. Design Patterns for Memory in Production

For practitioners building memory-augmented LLM systems, several design patterns have emerged as best practices.

Pattern 1: Tiered Memory with Explicit Promotion

Maintain separate tiers (ephemeral conversation buffer, session-level working memory, persistent long-term store) with explicit rules for when information is promoted from one tier to the next. Promotion criteria might include user confirmation ("Remember that I prefer formal language"), importance scoring by the model, or frequency of access.

Pattern 2: Entity-Centric Memory

Rather than storing raw conversation turns, extract and maintain structured records for key entities (users, projects, documents, decisions). Each entity record is updated incrementally as new information arrives. This approach avoids the problem of retrieving outdated information, because each entity's record always reflects the latest known state.

Pattern 3: Memory as Tool

Expose memory operations (search, store, update, delete) as tools that the agent can call explicitly, following the patterns described in Section 22.2. This gives the agent full control over when and what to remember, at the cost of additional token overhead for the tool-calling protocol. This is the approach used by MemGPT/Letta and has proven effective for long-running agent tasks.

Tip

Start with the simplest memory architecture that meets your needs. For most chatbot applications, a sliding window of recent messages plus a summary of older messages is sufficient. Add retrieval-augmented memory only when you need to recall information from specific past conversations. Add entity-centric memory only when your application tracks entities that evolve over time. Premature memory architecture complexity is a common source of bugs and latency in production systems.

7. Open Challenges

Several challenges remain open in the design of memory systems for LLMs:

Exercise 34.6.1: Memory Tier Design

You are building a personal assistant that needs to remember user preferences across months of conversations. The context window is 128K tokens.

  1. Design a memory tier allocation: how many tokens would you assign to core memory, working memory, conversation buffer, and archival retrieval? Justify each choice.
  2. The user says "I moved from New York to London." Which memory tier(s) must be updated, and what happens to the old "lives in New York" entry?
  3. Describe a failure mode where the memory system could produce a worse outcome than a stateless system with no memory at all.
Show Answer

1. A reasonable allocation might be: core (2K tokens for system instructions and user profile), working (8K for current task scratchpad), conversation buffer (48K for recent messages), archival retrieval (70K reserved for fetched long-term memories). The exact split depends on typical conversation length and how frequently archival search is needed. 2. The core memory (user profile) must be updated to replace "New York" with "London." The working memory may need a note about the transition. The old entry should be overwritten, not appended, to avoid contradiction. Optionally, the archival store can retain a timestamped record ("lived in New York until [date]") for historical context. 3. If the memory system retrieves a stale or contradictory memory (e.g., an old preference the user has since changed), the assistant could confidently give wrong advice that it would not have given without memory. Stale memories are worse than no memory because they carry false authority.

Exercise 34.6.2: Memory and Hallucination

An agent retrieves a memory that says "User's favorite restaurant is Chez Marie." The memory was stored 8 months ago. The agent has no mechanism to verify whether this information is still current.

  1. Should the agent state this as fact, qualify it with uncertainty, or ask the user to confirm? Explain the tradeoff for each approach.
  2. Design a simple "memory freshness" scoring function that discounts old memories. What decay rate would you use, and why?
Show Answer

1. Stating as fact risks being wrong if preferences changed. Qualifying with uncertainty ("Last time we discussed this, your favorite was Chez Marie; is that still the case?") is safest but adds friction to every interaction. Silently confirming is risky for time-sensitive facts but acceptable for stable preferences. The best approach depends on the cost of being wrong: restaurant recommendations are low-stakes (qualify occasionally), but medical information should always be confirmed. 2. An exponential decay like score = base_importance * exp(-lambda * days_since_stored) works well. A lambda of 0.003 gives a half-life of about 230 days, meaning an 8-month-old memory retains roughly 50% of its original weight. Preferences that are confirmed by the user should have their timestamp refreshed.

Key Takeaways

What Comes Next

In the next section, Section 34.7: Mechanistic Interpretability at Scale, we examine how researchers are opening the "black box" of frontier models, using sparse autoencoders and circuit analysis to understand what individual neurons and circuits compute.

References & Further Reading
Memory-Augmented Architectures

Packer, C., Wooders, S., Lin, K., et al. (2023). "MemGPT: Towards LLMs as Operating Systems." arXiv:2310.08560.

Introduces an operating-system inspired approach to LLM memory management, using virtual context and paging to handle conversations far exceeding the context window. The primary reference for the OS analogy developed in this section.

📄 Paper

Wu, Y., Rabe, M. N., Hutchins, D., and Szegedy, C. (2022). "Memorizing Transformers." ICLR 2022.

Augments transformers with a kNN-based external memory that the model can attend to during inference. Demonstrates that explicit retrieval from stored representations improves factual accuracy on long documents.

📄 Paper

Bulatov, A., Kuratov, Y., and Burtsev, M. (2023). "Scaling Transformer to 1M Tokens and Beyond with RMT." arXiv:2304.11062.

Proposes Recurrent Memory Transformer, which uses special memory tokens that persist across segments. Shows how to extend effective context length to millions of tokens with bounded compute.

📄 Paper

Munkhdalai, T., Faruqui, M., and Gopal, S. (2024). "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention." arXiv:2404.07143.

Introduces a compressive memory mechanism that integrates into standard attention, enabling bounded-memory processing of unbounded sequences. Represents a particularly elegant approach to the infinite context problem.

📄 Paper
Foundational Theory

Graves, A., Wayne, G., and Danihelka, I. (2014). "Neural Turing Machines." arXiv:1410.5401.

The pioneering work on differentiable external memory for neural networks, establishing the paradigm of learned read/write operations. This paper provides the historical foundation for all memory-augmented architectures discussed in this section.

📄 Paper

Merrill, W. (2023). "Formal Languages and the NLP Black Box." EMNLP Tutorial.

A comprehensive tutorial on the formal language theory perspective on neural networks, connecting computational complexity to architectural choices. Provides the theoretical tools for analyzing whether memory extensions push models toward Turing completeness.

📄 Paper

Ebbinghaus, H. (1885). Memory: A Contribution to Experimental Psychology. (Translated by Ruger and Bussenius, 1913.)

The classic study establishing the forgetting curve and spacing effect in human memory. Provides the cognitive science analogy that motivates the distinction between working memory and long-term memory in AI systems.

📖 Book