Libraries and Frameworks

Section 40.2

"A conversation memory primitive is just a ring buffer that learned to take itself too seriously."

KVKV, Reluctant-Context-Janitor AI Agent
Big Picture

Libraries and frameworks for conversational AI split into four layers: conversation memory primitives (how do you store and replay the last N turns? how do you summarize beyond the context window? how do you remember user facts across sessions?); orchestration frameworks for chat agents (LangChain, LlamaIndex chat engines, OpenAI Assistants, Anthropic conversations) that wire the LLM to tools, memory, and state; chat UI frameworks (Chainlit, Streamlit, Gradio, AG-UI) that turn an agent into a usable application; and voice-agent runtimes (Pipecat, LiveKit Agents, Vocode, Deepgram SDK) that pair STT, LLM, and TTS into a real-time pipeline. This section is a tour of the libraries in 2026 with opinionated pick-when guidance.

Prerequisites

This section assumes the conversational-AI platforms from Section 40.1 and the LLM agent framework vocabulary from Section 14.2.

The shape of the stack in 2026 is converging: a chat agent is a state object that carries a message history, optional summarized long-term memory, structured user facts, tool definitions, and a system prompt; the runtime feeds new user messages plus the accumulated state to an LLM, processes any tool calls, and appends the response to history. Almost every framework below is a variation on that shape, differing in how aggressively it owns the state, what it does about retrieval and tools, and whether it ships a UI.

40.2.1 Conversation memory primitives

A friendly chatbot robot with a literal goldfish in a small fishbowl balanced on its head, trying hard to remember something. Speech bubble: 'something about a dog?'
Figure 40.2.1: Most chatbots remember by accident. Real conversation memory is short-term (in hand), medium-term (on the hook), and long-term (in the user profile).

The single hardest design problem in long-running conversational AI is "what do we remember and what do we throw away?" Memory primitives are the building blocks that make that choice principled.

Library Shortcut: mem0 as the drop-in long-term memory layer

The primitives above (verbatim window, summary, KG, vector) all assume you will assemble them yourself. mem0ai (Mem0, 2024) collapses that scaffolding into a two-method API: memory.add(messages, user_id=...) runs fact extraction and contradiction resolution, and memory.search(query, user_id=...) returns the relevant facts ready to splice into the next system prompt. Prefer mem0ai when you want long-term personalization today without authoring a memory pipeline; graduate to Zep when you need a separately-versioned memory service or multi-agent shared memory.

Show code
pip install mem0ai
from mem0 import Memory
from openai import OpenAI

memory, llm = Memory(), OpenAI()
def chat(user_id: str, user_msg: str) -> str:
    facts = memory.search(query=user_msg, user_id=user_id)["results"]
    sys = "User facts:\n" + "\n".join(f["memory"] for f in facts)
    reply = llm.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "system", "content": sys},
                  {"role": "user", "content": user_msg}],
    ).choices[0].message.content
    memory.add([{"role": "user", "content": user_msg},
                {"role": "assistant", "content": reply}], user_id=user_id)
    return reply
Code Fragment 40.2.1.1: A complete add-and-search memory loop wired into a chat turn.
Key Insight
Memory is the most under-engineered part of most chatbots

The single most common production complaint about chatbots is "it doesn't remember what I told it." This is almost never a model capability problem and almost always an application-design problem: the system prompt does not surface the previous summary, the verbatim window is too short, or the structured user facts are not being fetched. Memory architecture (what to summarize, when to extract a fact, when to retrieve) deserves at least as much design effort as the prompt itself.

40.2.2 Conversation orchestration frameworks

A small cartoon cemetery in autumn. Three small tombstones labelled Chains, LCEL, AgentExecutor. Next to them a fresh sapling with a sign LangGraph 2024. A gardener-robot waters it.
Figure 40.2.2: Framework half-life is ~18 months. Pin the API surface you actually use; the abstractions above it will get renamed at least once before your roadmap ships.

Orchestration frameworks own the "what happens between user message and bot response" loop: tool selection, retrieval, multi-step reasoning, memory updates, streaming.

40.2.3 Chat UI frameworks

A bot without a usable UI is a script. The chat-UI layer is what turns an agent into a product (or at least a testable prototype). The 2026 toolkit clusters into "Python-first prototyping" (Chainlit, Streamlit, Gradio) and "production embeddable widgets" (your own React + a streaming hook, or framework-specific kits).

40.2.4 Voice-agent runtimes

Voice-agent runtimes are the libraries that pair STT, LLM, and TTS providers into a coherent real-time conversation pipeline. They are the most operationally complex part of the conversational AI stack because every component (turn detection, barge-in handling, end-of-speech, prosody, audio jitter) is itself a research area.

40.2.5 Message format and protocol libraries

A handful of utility libraries handle the message-format plumbing that every conversational system needs: rendering Markdown and code blocks in chat, parsing tool-call JSON, validating function-call schemas, streaming SSE events.

40.2.6 A working stack

Real-World Scenario
A 2026 conversational AI reference stack

Who: A 2026 engineering team building a multi-channel chat-and-voice agent from scratch.

Situation: The product had to span web chat, voice, and a few API integrations, with shared memory and a single conversation log across channels.

Problem: The conversational AI library landscape (orchestration, memory, voice, structured output, provider abstraction) is large enough that ad-hoc choices produced incompatible pieces and fragile integrations.

Dilemma: Pick a single all-in-one framework (lock-in, limited per-layer optimality) or compose best-of-breed libraries (more wiring, but each layer remains swappable).

Decision: The team standardized on a composed, boring-but-correct reference stack rather than an all-in-one framework.

How: The stack was roughly: Anthropic Messages API or OpenAI Responses API for the LLM, LangGraph for orchestration if the conversation has structure (otherwise direct SDK calls), ConversationSummaryBufferMemory or Zep for memory, Vercel AI SDK for the web chat UI, LiveKit Agents or Pipecat for voice, Instructor for any structured-output extractions, and LiteLLM as the abstraction layer if you swap models for cost or A/B testing.

Result: A multi-channel agent that shared memory across web and voice, with each layer swappable when a better option appeared and end-to-end instrumentation for cost and latency.

Lesson: None of the pieces in a 2026 conversational AI stack are novel; the wins are mostly in making the boring-but-correct composition work together and instrumented end-to-end, not in picking exotic libraries.

Library Shortcut: Canonical chat-with-memory loop
from anthropic import Anthropic
from zep_python.client import Zep

client = Anthropic()
zep = Zep(api_key="...")

def chat_turn(user_id: str, session_id: str, user_msg: str) -> str:
    # 1. Append user message and fetch memory context
    zep.memory.add(session_id, messages=[{"role": "user", "content": user_msg}])
    mem = zep.memory.get(session_id)
    # 2. Build messages: summary + verbatim recent + new user msg
    messages = [
        {"role": "user", "content": f"Context: {mem.summary}\n\nUser profile: {mem.facts}"},
        *[{"role": m.role, "content": m.content} for m in mem.messages[-8:]],
        {"role": "user", "content": user_msg},
    ]
    # 3. Call the model
    resp = client.messages.create(model="claude-sonnet-4-5", max_tokens=1024,
                                    system="You are a helpful assistant.",
                                    messages=messages)
    reply = resp.content[0].text
    # 4. Persist assistant turn so memory stays in sync
    zep.memory.add(session_id, messages=[{"role": "assistant", "content": reply}])
    return reply
Code Fragment 40.2.2a: The verbatim-window-plus-summary-plus-facts pattern is the production default for chat memory; Zep, LangChain's SummaryBufferMemory, and most "memory layer" libraries are variations on this shape.

40.2.7 Comparing the libraries

Table 40.2.1a: Conversational AI libraries by role and pick-when.
Library Layer Pick when Avoid when
LangGraph Orchestration Conversation has structure or phases Stateless single-turn use
LlamaIndex chat engines Orchestration over docs Chat-with-docs is the use case Generic multi-tool agent
OpenAI Assistants API Hosted orchestration Thin-client architecture Need direct history control
Zep / Mem0 Memory Multi-session, multi-agent memory Single-session ephemeral chat
Chainlit UI prototyping Internal demo, chat-only app Production end-user surface
Vercel AI SDK Production UI TypeScript / Next.js front-end Python-only stack
LiveKit Agents Voice runtime WebRTC voice in app or browser Phone calls / SIP / Twilio
Pipecat Voice runtime Pipeline-explicit voice control Canonical happy-path voice
Vocode Voice runtime Telephony first (Twilio, SIP) WebRTC-only browser channel
Instructor Structured output Typed slot-filling extraction Free-form prose responses
Warning: Beware framework gravity in 2026

Every framework in this section has at least one "we are now the way to do agents" story (LangChain, LlamaIndex, Pydantic AI, DSPy, OpenAI Agents SDK, Semantic Kernel, AutoGen). The right way to decide is to ignore the marketing and pick by: (1) does the framework's primary abstraction match the shape of your problem? (2) is your team comfortable in its host language and idioms? (3) does it ship with the integrations you need (channels, vector DBs, models)? Most teams that switch frameworks mid-project did not lose money on the framework, they lost it on the year of accumulated prompt engineering and evals that need to be re-validated under the new framework's conventions.

Note
The OpenAI Agents SDK, AutoGen, and the agent-frameworks plural

2024-25 saw a Cambrian explosion of agent frameworks aimed at multi-agent conversations: OpenAI's Agents SDK (the 2025 evolution of Swarm), Microsoft AutoGen, CrewAI, and many others. These deserve separate treatment because their primary abstraction is "agent talking to agent" rather than "human talking to agent", which is the focus of this chapter. Chapter 29 covers agent frameworks; the libraries here are for human-facing conversational AI.

40.2.8 Memory architecture patterns in production

The library catalog above is only useful in the context of memory architecture choices that the libraries enable. The 2026 patterns that mature production teams converge on are listed below. They are not five independent choices: a typical companion-bot stack uses all five layered together, and the layering itself is the architecture. Read the list as a stack from "highest fidelity, shortest horizon" to "lowest fidelity, longest horizon".

40.2.9 Streaming and real-time token handling

Every production chatbot eventually needs streaming. The library landscape around streaming clusters into four concerns, each tied to a different layer of the stack: transport (SSE vs WebSocket), granularity (token vs chunk), structured-output streaming (tool calls), and lifecycle (cancellation). Skipping any one is a common cause of user-visible "the bot just freezes" bug reports.

40.2.10 Evaluation and observability libraries

A conversational AI that ships without evaluation is a conversational AI that drifts unmonitored. The library layer here is shared with Part IX (Evaluation & Observability) but the chat-specific picks are worth flagging in this chapter:

40.2.11 Choosing among the orchestration frameworks

Chat orchestration framework selection
Figure 40.2.3: A decision aid for picking a chat orchestration framework, mapping team profile, hosting preference, and feature gravity (graphs, tools, observability) onto the leading 2026 options.
Fun Fact: The framework half-life is 18 months

Empirically, LangChain went through three primary abstractions (Chains, LCEL, LangGraph) in 24 months. LlamaIndex went through Index, QueryEngine, and Workflow in similar time. The library half-life for "the recommended way to do X" is roughly 18 months in this space. The implication is that the code you write should be as small a delta from the SDK direct calls as possible: when (not if) the framework's recommended abstraction changes, you want a small migration, not a rewrite. Treat the framework as a thin and replaceable layer.

What's Next?

In the next section, Section 40.3: Datasets and Benchmarks, we build on the material covered here.

Further Reading
LangChain (2024). "Memory in LangChain: a deep dive." LangChain Documentation. python.langchain.com/docs/concepts/memory. Canonical reference for ConversationBufferMemory, SummaryMemory, KGMemory, and the trim-messages refactor in LangChain 0.2.
OpenAI (2024). "Assistants API Overview." OpenAI Platform Documentation. platform.openai.com/docs/assistants. Reference for the server-side Thread model and the file-search + code-interpreter primitives that distinguish Assistants from raw Chat Completions.
Anthropic (2024). "Building with the Messages API." Anthropic Documentation. docs.anthropic.com/en/api/messages. The stateless-history pattern that contrasts with OpenAI Assistants; canonical reference for client-owned conversation state.
Pipecat AI (2024). "Pipecat: Open source framework for voice and multimodal conversational AI." Pipecat Documentation. docs.pipecat.ai. Reference for the Frame-flowing voice pipeline model that defines real-time voice agents in 2024-26.
LiveKit (2024). "LiveKit Agents v1.0." LiveKit Documentation. docs.livekit.io/agents. Reference for the WebRTC-participant voice-agent pattern.
Zep AI (2024). "Zep: Long-term memory for AI assistants." Zep Documentation. help.getzep.com. Reference for the dedicated memory-service architecture, including the fact-extraction and per-session-per-user model.