Production agent systems must survive process restarts, handle long-running workflows, and support debugging of past executions. LangGraph: Stateful Agent Workflows's checkpointing system persists the graph state after every node, enabling recovery from failures, time-travel debugging, and the interrupt mechanism covered in Section M.3. This section covers the built-in checkpointer backends, thread management, replay from arbitrary checkpoints, and strategies for production deployment.
1. What Is Checkpointing?
A checkpoint is a serialized snapshot of the graph's entire state at a specific point in execution. Every time a node completes, LangGraph: Stateful Agent Workflows saves a checkpoint through the configured backend. These checkpoints serve three purposes: they enable the interrupt/resume workflow, they allow replaying a graph from any past state, and they provide a detailed audit trail for debugging.
Each checkpoint is identified by a thread ID (representing a conversation or workflow session) and a sequential checkpoint ID within that thread. Together, these form a complete execution history.
2. MemorySaver: In-Memory Checkpointing
The simplest checkpointer stores state in a Python dictionary. This is ideal for development, testing, and short-lived scripts where persistence across restarts is not needed.
from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import StateGraph, START, END
from langchain_core.messages import HumanMessage
class ChatState(TypedDict):
messages: Annotated[list, add_messages]
def echo(state: ChatState) -> dict:
last = state["messages"][-1].content
return {"messages": [("assistant", f"Echo: {last}")]}
builder = StateGraph(ChatState)
builder.add_node("echo", echo)
builder.add_edge(START, "echo")
builder.add_edge("echo", END)
# Compile with in-memory checkpointing
memory = MemorySaver()
graph = builder.compile(checkpointer=memory)
# Each thread_id represents an independent conversation
config_a = {"configurable": {"thread_id": "user-alice"}}
config_b = {"configurable": {"thread_id": "user-bob"}}
graph.invoke({"messages": [HumanMessage(content="Hello")]}, config=config_a)
graph.invoke({"messages": [HumanMessage(content="Hi there")]}, config=config_b)
# States are isolated per thread
state_a = graph.get_state(config_a)
state_b = graph.get_state(config_b)
print("Alice:", state_a.values["messages"][-1].content)
print("Bob:", state_b.values["messages"][-1].content)
from langgraph.checkpoint.sqlite import SqliteSaver
# Store checkpoints in a local file
with SqliteSaver.from_conn_string("checkpoints.db") as checkpointer:
graph = builder.compile(checkpointer=checkpointer)
config = {"configurable": {"thread_id": "persistent-thread-1"}}
graph.invoke(
{"messages": [HumanMessage(content="Remember this message.")]},
config=config
)
# Later, even after restarting the process:
with SqliteSaver.from_conn_string("checkpoints.db") as checkpointer:
graph = builder.compile(checkpointer=checkpointer)
config = {"configurable": {"thread_id": "persistent-thread-1"}}
state = graph.get_state(config)
print("Recovered:", state.values["messages"][-1].content)
from langgraph.checkpoint.sqlite.aio import AsyncSqliteSaver
async def run():
async with AsyncSqliteSaver.from_conn_string("checkpoints.db") as saver:
graph = builder.compile(checkpointer=saver)
result = await graph.ainvoke(
{"messages": [HumanMessage(content="Async hello")]},
config={"configurable": {"thread_id": "async-1"}}
)
print(result["messages"][-1].content)
MemorySaver for in-memory checkpointing. Each thread_id maintains an independent state history. This backend is fast but loses all data when the process exits.MemorySaver stores everything in process memory. A server restart, crash, or deployment wipes all state. For any workflow that must survive restarts or that involves human-in-the-loop delays, use a persistent backend such as SqliteSaver or PostgresSaver.
3. SqliteSaver: File-Based Persistence
For local development and single-server deployments, SqliteSaver writes checkpoints to a SQLite database file. State survives process restarts with zero infrastructure overhead.
from langgraph.checkpoint.postgres import PostgresSaver
DB_URI = "postgresql://user:password@localhost:5432/langgraph_checkpoints"
with PostgresSaver.from_conn_string(DB_URI) as checkpointer:
# Create the checkpoint tables on first run
checkpointer.setup()
graph = builder.compile(checkpointer=checkpointer)
config = {"configurable": {"thread_id": "prod-session-42"}}
graph.invoke(
{"messages": [HumanMessage(content="Production message")]},
config=config
)
SqliteSaver for file-based persistence. The checkpoint file (checkpoints.db) survives process restarts, making this suitable for local development and prototyping.3.1 Async SQLite
For async applications, use AsyncSqliteSaver which provides the same file-based persistence with non-blocking I/O. The API is identical except that you use async with and await.
# Continue an existing conversation
config = {"configurable": {"thread_id": "user-alice"}}
# Add a second message to Alice's thread
graph.invoke(
{"messages": [HumanMessage(content="What did I say before?")]},
config=config
)
# The graph sees the full message history for this thread
state = graph.get_state(config)
for msg in state.values["messages"]:
print(f"{msg.type}: {msg.content}")
SqliteSaver. Use this when your application runs on an async event loop (FastAPI, for instance) to avoid blocking I/O.4. PostgresSaver: Production-Grade Persistence
For multi-server deployments, PostgresSaver stores checkpoints in a PostgreSQL database. Multiple application instances can share the same checkpoint store, enabling load balancing and horizontal scaling.
# List all checkpoints for a thread
config = {"configurable": {"thread_id": "user-alice"}}
history = list(graph.get_state_history(config))
print(f"Found {len(history)} checkpoints")
for i, snapshot in enumerate(history):
print(f" [{i}] checkpoint_id={snapshot.config['configurable']['checkpoint_id']}")
print(f" next={snapshot.next}")
msg_count = len(snapshot.values.get("messages", []))
print(f" messages={msg_count}")
# Replay from an earlier checkpoint
earlier = history[-1] # the first checkpoint
replay_config = {
"configurable": {
"thread_id": "user-alice",
"checkpoint_id": earlier.config["configurable"]["checkpoint_id"]
}
}
result = graph.invoke(
{"messages": [HumanMessage(content="Let me try a different approach.")]},
config=replay_config
)
print("Replayed:", result["messages"][-1].content)
PostgresSaver for production-grade persistence. Call setup() once to create the required database tables. After that, checkpoints are stored automatically.For high-throughput applications, pass a connection pool instead of a connection string. Both psycopg (sync) and psycopg_pool (async) are supported. Connection pooling reduces database overhead and improves response times under load.
5. Thread Management
Threads are the organizational unit for checkpoints. Each thread represents an independent execution context, typically mapped to a user session or conversation. You manage threads through the thread_id field in the config.
thread_id and appends the new input, giving the graph access to the full conversation history.6. Time-Travel Debugging
Every checkpoint in a thread is preserved, creating a full execution timeline. You can list all checkpoints for a thread, inspect the state at any point, and even replay the graph from a past checkpoint. This is invaluable for debugging agent behavior.
Thread: user-alice
═══════════════════════════════════════════════════
checkpoint_0 checkpoint_1 checkpoint_2
┌──────────┐ ┌──────────┐ ┌──────────┐
│ messages: │───▶│ messages: │───▶│ messages: │
│ [msg_0] │ │ [msg_0, │ │ [msg_0, │
│ │ │ msg_1] │ │ msg_1, │
└──────────┘ └──────────┘ │ msg_2] │
▲ └──────────┘
│
Replay from here creates a new branch
get_state_history returns all checkpoints in reverse chronological order. You can resume from any checkpoint by including its checkpoint_id in the config, effectively "rewinding" the conversation.checkpoint_0 creates a new branch of execution without modifying the original checkpoint history.Replaying from a past checkpoint does not delete or modify subsequent checkpoints. Instead, it creates a new branch in the execution history. This means you can safely experiment with alternative paths without losing the original execution trace.
7. Choosing a Checkpointer
The following table summarizes when to use each backend:
| Backend | Persistence | Multi-Process | Best For |
|---|---|---|---|
MemorySaver |
None (in-process only) | No | Unit tests, quick prototyping |
SqliteSaver |
File on disk | No | Local development, single-server apps |
PostgresSaver |
Database | Yes | Production, multi-server deployment |
8. Recovery from Failures
Because checkpoints are saved after every node, a failure mid-execution does not lose all progress. When the graph is re-invoked with the same thread_id, it automatically loads the last successful checkpoint. If the failed node was the culprit (for instance, a transient API error), the graph retries from that point rather than starting over.
For workflows that span hours or days (such as multi-step approval processes), this resilience is essential. Combining persistent checkpointing with the interrupt mechanism from Section M.3 gives you a durable workflow engine that can survive server restarts, deployments, and infrastructure failures.
Beyond recovery, checkpoints serve as a complete audit trail of every decision an agent made. For compliance-sensitive applications (financial services, healthcare, legal), this trail provides evidence of what the agent did, when it did it, and what state it had at each step. Pair this with the human approval logs from Section M.3 for a comprehensive audit system.
9. Summary
This section covered LangGraph: Stateful Agent Workflows's checkpointing system, from lightweight in-memory storage to production-grade PostgreSQL persistence. You learned how to configure checkpointers, manage conversation threads, perform time-travel debugging, and leverage checkpoints for failure recovery. In Section M.5, you will see how to compose multiple graphs into multi-agent systems using subgraphs and supervisor patterns.