Building Conversational AI with LLMs and Agents
Appendix L: LangChain: Chains, Agents, and Retrieval

Memory and Conversation Management

Big Picture

Large language models are stateless: each API call starts from scratch with no recollection of previous turns. To build conversational applications, you must explicitly manage chat history and decide how much context to include in every request. LangChain: Chains, Agents, and Retrieval provides several memory strategies, from simple buffer storage to intelligent summarization, along with a modern LCEL-native approach using RunnableWithMessageHistory.

1. The Memory Problem

When you send a message to an LLM API, the model has no built-in awareness of what was said before. If a user asks "What is Python?" and then follows up with "What are its main libraries?", the model has no idea what "its" refers to unless you include the previous exchange in the new request. Memory management is the art of deciding which previous messages to include, how to compress them, and where to store them between requests.

LangChain: Chains, Agents, and Retrieval offers two generations of memory tools. The legacy memory classes (such as ConversationBufferMemory) were designed for the old LLMChain API. The modern approach uses RunnableWithMessageHistory with LCEL chains. Both are covered here because legacy memory classes appear frequently in existing codebases and tutorials.

2. Legacy Memory Classes

ConversationBufferMemory

The simplest memory strategy stores every message in a growing list. ConversationBufferMemory keeps the complete conversation history and injects it into the prompt on every call. This works well for short conversations but will eventually exceed the model's context window.

This example demonstrates buffer memory with a legacy chain.

from langchain.chains import ConversationChain
from langchain_openai import ChatOpenAI
from langchain.memory import ConversationBufferMemory

model = ChatOpenAI(model="gpt-4o-mini", temperature=0)
memory = ConversationBufferMemory(return_messages=True)

conversation = ConversationChain(llm=model, memory=memory)

# First turn
response1 = conversation.invoke({"input": "My name is Alice."})
print(response1["response"])

# Second turn: the model remembers the name
response2 = conversation.invoke({"input": "What is my name?"})
print(response2["response"])  # "Your name is Alice."

# Inspect stored messages
print(memory.chat_memory.messages)
Hello, Alice! It's nice to meet you. How can I help you today? Your name is Alice. [HumanMessage(content='My name is Alice.'), AIMessage(content="Hello, Alice! It's nice to meet you. How can I help you today?"), HumanMessage(content='What is my name?'), AIMessage(content='Your name is Alice.')]

ConversationSummaryMemory

For long conversations, storing every message becomes impractical. ConversationSummaryMemory uses the LLM itself to maintain a running summary of the conversation. Each time new messages are added, the summary is updated to incorporate the new information while keeping the token count bounded.

The following example shows how summary memory compresses a multi-turn conversation.

from langchain.memory import ConversationSummaryMemory
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4o-mini", temperature=0)
memory = ConversationSummaryMemory(llm=model, return_messages=True)

# Simulate adding conversation turns
memory.save_context(
    {"input": "I'm building a chatbot for customer support."},
    {"output": "That sounds like a great project! What domain will it cover?"}
)
memory.save_context(
    {"input": "It will handle billing questions for a SaaS product."},
    {"output": "Got it. You'll want to integrate with your billing API."}
)
memory.save_context(
    {"input": "We use Stripe for payments."},
    {"output": "Stripe has excellent APIs. I can help you set up the integration."}
)

# The memory stores a summary, not individual messages
print(memory.load_memory_variables({}))
# Output: a condensed summary mentioning the chatbot, billing, SaaS, and Stripe
{'history': [SystemMessage(content='The human is building a customer support chatbot for a SaaS product that handles billing questions. They use Stripe for payment processing. The AI offered to help set up the Stripe API integration.')]}
Tip

Summary memory incurs an extra LLM call each time the summary is updated. For production systems, consider updating the summary only every N turns or when the buffer exceeds a token threshold rather than on every single turn.

ConversationTokenBufferMemory

A practical middle ground between full buffer and summary: ConversationTokenBufferMemory keeps recent messages up to a specified token limit. When the limit is exceeded, the oldest messages are dropped. This gives the model a sliding window of recent context without any summarization cost.

from langchain.memory import ConversationTokenBufferMemory
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4o-mini")
memory = ConversationTokenBufferMemory(
    llm=model,
    max_token_limit=200,  # Keep only the most recent ~200 tokens
    return_messages=True
)

# Add several turns
for i in range(10):
    memory.save_context(
        {"input": f"Question {i}: Tell me about topic {i}."},
        {"output": f"Here is information about topic {i}. " * 5}
    )

# Only the most recent messages (within 200 tokens) are retained
messages = memory.load_memory_variables({})
print(f"Messages in memory: {len(messages['history'])}")
Messages in memory: 4
Legacy API

The memory classes shown above (ConversationBufferMemory, ConversationSummaryMemory, ConversationTokenBufferMemory) are part of LangChain: Chains, Agents, and Retrieval's legacy API. They work with ConversationChain and LLMChain. For new projects using LCEL, use RunnableWithMessageHistory instead (covered next).

3. Modern Memory with LCEL

The recommended approach for managing conversation history in LCEL chains is RunnableWithMessageHistory. This wrapper adds message history to any runnable by looking up (or creating) a history object keyed by a session ID. It is more flexible than legacy memory because it works with any LCEL chain and supports pluggable storage backends.

This example builds a conversational chain with LCEL and in-memory history storage.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory

# In-memory store: maps session_id to ChatMessageHistory
store = {}

def get_session_history(session_id: str) -> ChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

# Build the chain with a MessagesPlaceholder for history
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}")
])
model = ChatOpenAI(model="gpt-4o-mini", temperature=0)
chain = prompt | model | StrOutputParser()

# Wrap with message history
conversational_chain = RunnableWithMessageHistory(
    chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="history"
)

# Use session IDs to maintain separate conversations
config = {"configurable": {"session_id": "user-123"}}

response1 = conversational_chain.invoke(
    {"input": "My favorite language is Rust."},
    config=config
)
print(response1)

response2 = conversational_chain.invoke(
    {"input": "What is my favorite language?"},
    config=config
)
print(response2)  # "Your favorite language is Rust."
That's great! Rust is a fantastic language, known for its memory safety guarantees and performance. What would you like to know about it? Your favorite language is Rust.
Diagram
Figure L.2.1: The RunnableWithMessageHistory wrapper intercepts each invocation, loads prior messages from the history store, injects them into the prompt via MessagesPlaceholder, and saves the new exchange after the chain completes.

4. Persistent Chat History Stores

The in-memory ChatMessageHistory is fine for development and testing, but production applications need durable storage. LangChain: Chains, Agents, and Retrieval's community package provides integrations for Redis, PostgreSQL, MongoDB, DynamoDB, and many other backends. Each integration implements the same BaseChatMessageHistory interface, so swapping backends requires changing only the get_session_history function.

This example shows how to use Redis as a persistent history store.

from langchain_community.chat_message_histories import RedisChatMessageHistory

def get_session_history(session_id: str):
    return RedisChatMessageHistory(
        session_id=session_id,
        url="redis://localhost:6379/0",
        ttl=3600  # Expire conversations after 1 hour
    )

# The rest of the chain setup is identical to the in-memory example.
# Just pass this function to RunnableWithMessageHistory.

5. Strategies for Long Conversations

As conversations grow, you must choose a strategy for keeping the prompt within the model's context window. The table below summarizes the main approaches and their trade-offs.

Conversation Memory Strategies
Strategy Mechanism Pros Cons
Full buffer Keep all messages Complete context, no information loss Hits context limit on long conversations
Token buffer Drop oldest messages beyond token limit Predictable token usage, no extra LLM calls Loses early context abruptly
Summary LLM summarizes older messages Retains key facts from entire conversation Extra LLM calls, potential summary drift
Summary + buffer Summary of old messages, full recent messages Best of both: context and recency Most complex to implement

For LCEL-based chains, you can implement any of these strategies by customizing the function that loads and trims history before injection. The following example shows a simple token-trimming approach.

from langchain_core.messages import trim_messages

# Trim history to fit within a token budget before passing to the model
trimmer = trim_messages(
    max_tokens=1000,
    strategy="last",          # Keep the most recent messages
    token_counter=model,       # Use the model's tokenizer for accurate counts
    include_system=True,       # Always keep the system message
    allow_partial=False        # Don't cut messages in half
)

# Insert the trimmer into the chain
chain_with_trimming = (
    {"history": trimmer, "input": lambda x: x["input"]}
    | prompt
    | model
    | StrOutputParser()
)
Key Insight

For new LCEL-based projects, use RunnableWithMessageHistory with a pluggable history store and trim_messages for context window management. Legacy memory classes are still useful for understanding older codebases but should not be the foundation of new applications.