Section M.4: Persistence, Checkpointing, and Recovery | Building Conversational AI with LLMs and Agents

Big Picture

Production agent systems must survive process restarts, handle long-running workflows, and support debugging of past executions. LangGraph: Stateful Agent Workflows's checkpointing system persists the graph state after every node, enabling recovery from failures, time-travel debugging, and the interrupt mechanism covered in Section M.3. This section covers the built-in checkpointer backends, thread management, replay from arbitrary checkpoints, and strategies for production deployment.

1. What Is Checkpointing?

A checkpoint is a serialized snapshot of the graph's entire state at a specific point in execution. Every time a node completes, LangGraph: Stateful Agent Workflows saves a checkpoint through the configured backend. These checkpoints serve three purposes: they enable the interrupt/resume workflow, they allow replaying a graph from any past state, and they provide a detailed audit trail for debugging.

Each checkpoint is identified by a thread ID (representing a conversation or workflow session) and a sequential checkpoint ID within that thread. Together, these form a complete execution history.

2. MemorySaver: In-Memory Checkpointing

The simplest checkpointer stores state in a Python dictionary. This is ideal for development, testing, and short-lived scripts where persistence across restarts is not needed.

from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import StateGraph, START, END
from langchain_core.messages import HumanMessage

class ChatState(TypedDict):
    messages: Annotated[list, add_messages]

def echo(state: ChatState) -> dict:
    last = state["messages"][-1].content
    return {"messages": [("assistant", f"Echo: {last}")]}

builder = StateGraph(ChatState)
builder.add_node("echo", echo)
builder.add_edge(START, "echo")
builder.add_edge("echo", END)

# Compile with in-memory checkpointing
memory = MemorySaver()
graph = builder.compile(checkpointer=memory)

# Each thread_id represents an independent conversation
config_a = {"configurable": {"thread_id": "user-alice"}}
config_b = {"configurable": {"thread_id": "user-bob"}}

graph.invoke({"messages": [HumanMessage(content="Hello")]}, config=config_a)
graph.invoke({"messages": [HumanMessage(content="Hi there")]}, config=config_b)

# States are isolated per thread
state_a = graph.get_state(config_a)
state_b = graph.get_state(config_b)
print("Alice:", state_a.values["messages"][-1].content)
print("Bob:", state_b.values["messages"][-1].content)

Alice: Echo: Hello Bob: Echo: Hi there

from langgraph.checkpoint.sqlite import SqliteSaver

# Store checkpoints in a local file
with SqliteSaver.from_conn_string("checkpoints.db") as checkpointer:
    graph = builder.compile(checkpointer=checkpointer)

    config = {"configurable": {"thread_id": "persistent-thread-1"}}
    graph.invoke(
        {"messages": [HumanMessage(content="Remember this message.")]},
        config=config
    )

# Later, even after restarting the process:
with SqliteSaver.from_conn_string("checkpoints.db") as checkpointer:
    graph = builder.compile(checkpointer=checkpointer)

    config = {"configurable": {"thread_id": "persistent-thread-1"}}
    state = graph.get_state(config)
    print("Recovered:", state.values["messages"][-1].content)

from langgraph.checkpoint.sqlite.aio import AsyncSqliteSaver

async def run():
    async with AsyncSqliteSaver.from_conn_string("checkpoints.db") as saver:
        graph = builder.compile(checkpointer=saver)
        result = await graph.ainvoke(
            {"messages": [HumanMessage(content="Async hello")]},
            config={"configurable": {"thread_id": "async-1"}}
        )
        print(result["messages"][-1].content)

Code Fragment M.4.1: Using MemorySaver for in-memory checkpointing. Each thread_id maintains an independent state history. This backend is fast but loses all data when the process exits.

Warning: MemorySaver Is Not for Production

MemorySaver stores everything in process memory. A server restart, crash, or deployment wipes all state. For any workflow that must survive restarts or that involves human-in-the-loop delays, use a persistent backend such as SqliteSaver or PostgresSaver.

3. SqliteSaver: File-Based Persistence

For local development and single-server deployments, SqliteSaver writes checkpoints to a SQLite database file. State survives process restarts with zero infrastructure overhead.

Recovered: Echo: Remember this message.

from langgraph.checkpoint.postgres import PostgresSaver

DB_URI = "postgresql://user:password@localhost:5432/langgraph_checkpoints"

with PostgresSaver.from_conn_string(DB_URI) as checkpointer:
    # Create the checkpoint tables on first run
    checkpointer.setup()

    graph = builder.compile(checkpointer=checkpointer)

    config = {"configurable": {"thread_id": "prod-session-42"}}
    graph.invoke(
        {"messages": [HumanMessage(content="Production message")]},
        config=config
    )

Code Fragment M.4.2: Using SqliteSaver for file-based persistence. The checkpoint file (checkpoints.db) survives process restarts, making this suitable for local development and prototyping.

3.1 Async SQLite

For async applications, use AsyncSqliteSaver which provides the same file-based persistence with non-blocking I/O. The API is identical except that you use async with and await.

Echo: Async hello

# Continue an existing conversation
config = {"configurable": {"thread_id": "user-alice"}}

# Add a second message to Alice's thread
graph.invoke(
    {"messages": [HumanMessage(content="What did I say before?")]},
    config=config
)

# The graph sees the full message history for this thread
state = graph.get_state(config)
for msg in state.values["messages"]:
    print(f"{msg.type}: {msg.content}")

Code Fragment M.4.3: The async variant of SqliteSaver. Use this when your application runs on an async event loop (FastAPI, for instance) to avoid blocking I/O.

4. PostgresSaver: Production-Grade Persistence

For multi-server deployments, PostgresSaver stores checkpoints in a PostgreSQL database. Multiple application instances can share the same checkpoint store, enabling load balancing and horizontal scaling.

# List all checkpoints for a thread
config = {"configurable": {"thread_id": "user-alice"}}
history = list(graph.get_state_history(config))

print(f"Found {len(history)} checkpoints")
for i, snapshot in enumerate(history):
    print(f"  [{i}] checkpoint_id={snapshot.config['configurable']['checkpoint_id']}")
    print(f"       next={snapshot.next}")
    msg_count = len(snapshot.values.get("messages", []))
    print(f"       messages={msg_count}")

# Replay from an earlier checkpoint
earlier = history[-1]  # the first checkpoint
replay_config = {
    "configurable": {
        "thread_id": "user-alice",
        "checkpoint_id": earlier.config["configurable"]["checkpoint_id"]
    }
}
result = graph.invoke(
    {"messages": [HumanMessage(content="Let me try a different approach.")]},
    config=replay_config
)
print("Replayed:", result["messages"][-1].content)

Code Fragment M.4.4: Using PostgresSaver for production-grade persistence. Call setup() once to create the required database tables. After that, checkpoints are stored automatically.

Note: Connection Pooling

For high-throughput applications, pass a connection pool instead of a connection string. Both psycopg (sync) and psycopg_pool (async) are supported. Connection pooling reduces database overhead and improves response times under load.

5. Thread Management

Threads are the organizational unit for checkpoints. Each thread represents an independent execution context, typically mapped to a user session or conversation. You manage threads through the thread_id field in the config.

human: Hello ai: Echo: Hello human: What did I say before? ai: Echo: What did I say before?

Code Fragment M.4.5: Continuing a conversation in an existing thread. LangGraph: Stateful Agent Workflows automatically loads the latest checkpoint for the given thread_id and appends the new input, giving the graph access to the full conversation history.

6. Time-Travel Debugging

Every checkpoint in a thread is preserved, creating a full execution timeline. You can list all checkpoints for a thread, inspect the state at any point, and even replay the graph from a past checkpoint. This is invaluable for debugging agent behavior.

Found 3 checkpoints [0] checkpoint_id=ckpt_003 next=() messages=4 [1] checkpoint_id=ckpt_002 next=() messages=2 [2] checkpoint_id=ckpt_001 next=('echo',) messages=1 Replayed: Echo: Let me try a different approach.

Thread: user-alice
═══════════════════════════════════════════════════

  checkpoint_0     checkpoint_1     checkpoint_2
  ┌──────────┐    ┌──────────┐    ┌──────────┐
  │ messages: │───▶│ messages: │───▶│ messages: │
  │   [msg_0] │    │ [msg_0,  │    │ [msg_0,  │
  │           │    │  msg_1]  │    │  msg_1,  │
  └──────────┘    └──────────┘    │  msg_2]  │
       ▲                          └──────────┘
       │
  Replay from here creates a new branch

Code Fragment M.4.6: Time-travel debugging. get_state_history returns all checkpoints in reverse chronological order. You can resume from any checkpoint by including its checkpoint_id in the config, effectively "rewinding" the conversation.

Figure M.4.1: Time-travel debugging. Replaying from checkpoint_0 creates a new branch of execution without modifying the original checkpoint history.

Key Insight: Replay Does Not Overwrite History

Replaying from a past checkpoint does not delete or modify subsequent checkpoints. Instead, it creates a new branch in the execution history. This means you can safely experiment with alternative paths without losing the original execution trace.

7. Choosing a Checkpointer

The following table summarizes when to use each backend:

LangGraph: Stateful Agent Workflows Checkpointer Backends

Backend	Persistence	Multi-Process	Best For
`MemorySaver`	None (in-process only)	No	Unit tests, quick prototyping
`SqliteSaver`	File on disk	No	Local development, single-server apps
`PostgresSaver`	Database	Yes	Production, multi-server deployment

8. Recovery from Failures

Because checkpoints are saved after every node, a failure mid-execution does not lose all progress. When the graph is re-invoked with the same thread_id, it automatically loads the last successful checkpoint. If the failed node was the culprit (for instance, a transient API error), the graph retries from that point rather than starting over.

For workflows that span hours or days (such as multi-step approval processes), this resilience is essential. Combining persistent checkpointing with the interrupt mechanism from Section M.3 gives you a durable workflow engine that can survive server restarts, deployments, and infrastructure failures.

Key Insight: Checkpoints as an Audit Trail

Beyond recovery, checkpoints serve as a complete audit trail of every decision an agent made. For compliance-sensitive applications (financial services, healthcare, legal), this trail provides evidence of what the agent did, when it did it, and what state it had at each step. Pair this with the human approval logs from Section M.3 for a comprehensive audit system.

9. Summary

This section covered LangGraph: Stateful Agent Workflows's checkpointing system, from lightweight in-memory storage to production-grade PostgreSQL persistence. You learned how to configure checkpointers, manage conversation threads, perform time-travel debugging, and leverage checkpoints for failure recovery. In Section M.5, you will see how to compose multiple graphs into multi-agent systems using subgraphs and supervisor patterns.