Section 40.5: External Reading and Communities

"Jurafsky for the vocabulary, ACL for the research, the LiveKit Discord for the latency budgets, and Simon Willison for what shipped this morning. Voice agents have their own subreddit now."
Prompt, Multi-Channel Reading-List AI Agent

Big Picture

Conversational AI is unusually well-covered in writing because the topic sits at the intersection of NLP research (which has an active venue circuit), production engineering (which generates blog posts and conference talks), and consumer product design (which generates think-pieces and case studies). This section organizes the canon: foundational papers and pre-LLM textbooks that still teach the right vocabulary, academic venues that publish current research, vendor engineering blogs that document the production state of the art, community hubs where practitioners actually talk to each other, and the voice-agent subculture that emerged in 2024-25 as the realtime APIs shipped.

Prerequisites

This is an end-of-chapter reading list and assumes familiarity with the conversational-AI modules in Part VIII.

The single most important meta-skill for staying current on conversational AI is knowing where to look. The field moves fast enough that any list dates within a year, but the venues and communities that publish high-signal work are reasonably stable. The list below is organized by source type rather than topic so you can scan it for "where do I look when X happens?" rather than "what is the latest news on Y".

40.5.1 Foundational papers and textbooks

These are the long-cited papers and the pre-LLM textbooks that still teach the canonical vocabulary (dialogue acts, belief state, dialogue policy, response generation). A modern conversational AI practitioner can lose the field's history but not its taxonomy.

Speech and Language Processing, Jurafsky & Martin (Stanford, 3rd ed. online, ongoing 2008-2024) is the field's reference textbook, with Chapter 24 (Chatbots and Dialogue Systems) being the canonical introduction to dialogue acts, dialogue state tracking, frame-based dialogue, and conversational agent architecture. Even in 2026, when an LLM eats most of the explicit pipeline, the textbook's vocabulary remains the right one for talking precisely about conversational AI. Read Chapter 24 before reading anything else about dialogue systems; it is the field's grammar.
Conversational AI: Dialogue Systems, Conversational Agents, and Chatbots, Michael McTear (Morgan & Claypool, 2020; 2nd edition 2022) is the most complete recent textbook focused specifically on dialogue systems, covering the full evolution from rule-based to retrieval-based to neural to LLM-based dialogue. McTear's prior 2002 book and his 2016 Springer book are the prior generation; the 2020-22 edition is the one to read now. Pick this for a single textbook reference on conversational AI architecture.
Schank & Abelson, "Scripts, Plans, Goals, and Understanding" (Lawrence Erlbaum, 1977) is the foundational cognitive-science work on how humans use scripts (stereotyped event sequences like "going to a restaurant") to understand conversation, which remains the conceptual frame for slot-and-frame dialogue systems forty years later. Read for the conceptual underpinning of TOD systems and to appreciate that the modern LLM-based stack still operationalizes the script idea (just implicitly in the model weights rather than explicitly in the code).
Searle, "Speech Acts" (Cambridge, 1969) is the philosophical foundation of dialogue act theory ("request", "inform", "commissive", "expressive"), which became the labeling vocabulary in DailyDialog and the whole DSTC series. Read for the foundational distinction between locutionary, illocutionary, and perlocutionary acts (it is more useful for chatbot design than it sounds).
Chen, Liu, Yih, Yu, Gao & Yih, "A Survey on Dialogue Systems: Recent Advances and New Frontiers" (2017) is the standard pre-LLM survey of dialogue systems, useful as a snapshot of what the field looked like just before ChatGPT and as a roadmap for the techniques the LLMs absorbed. Read for the taxonomy of generative vs retrieval-based vs hybrid dialogue systems.
Roller, Dinan, Goyal, Ju, et al., "Recipes for Building an Open-Domain Chatbot" (Meta AI, 2020) is the BlenderBot 1 paper, the canonical reference for the recipe of blending persona, empathy, and knowledge in a single open chit-chat model. Read for the explicit decomposition of what makes a chat model engaging.
Shuster, Xu, Komeili, Ju, Smith, et al., "BlenderBot 3" (Meta AI, 2022) is BlenderBot 3 and the canonical reference for safety-and-deployment lessons learned by Meta's open-domain chatbot work. Read alongside the InstructGPT and Anthropic HH-RLHF papers as the immediate pre-ChatGPT generation of conversational alignment work.
Ouyang, Wu, Jiang, Almeida, et al., "Training Language Models to Follow Instructions with Human Feedback" (OpenAI, 2022) is the InstructGPT paper that defined the modern RLHF chat-tuning recipe. Read as the canonical reference for why chat models behave differently from base language models.
Bai, Kadavath, Kundu, Askell, et al., "Constitutional AI: Harmlessness from AI Feedback" (Anthropic, 2022) is the Constitutional AI paper that defined Anthropic's alternative chat-tuning recipe (RL from AI feedback rather than purely human preferences). Read alongside InstructGPT for the two dominant chat alignment recipes.

40.5.2 Academic venues for current research

The active conferences and workshops that publish state-of-the-art conversational AI research.

ACL (Association for Computational Linguistics, annual) is the field's main conference, with a strong dialogue track and the annual SIGDIAL workshop co-located. ACL Findings papers are useful filters for "good idea, maybe not main-track-novel" work. Track the ACL Anthology directly for new dialogue work.
EMNLP (Empirical Methods in Natural Language Processing, annual) is the second-tier (in volume, not quality) NLP conference where many dialogue and conversational AI papers land, especially empirical / dataset / benchmark contributions. Worth scanning the proceedings every year.
DSTC (Dialogue State Tracking Challenge) is the long-running challenge series and workshop that defines the dialogue-systems competition schedule. Each year's task report is a useful snapshot of where the practical-systems community is investing.
NeurIPS (Neural Information Processing Systems, annual) publishes a steady stream of foundational LLM and RLHF papers that anchor chat-model research. Track the Datasets and Benchmarks Track specifically for new conversational eval datasets.
ICML (International Conference on Machine Learning, annual) publishes alignment, RLHF, and safety work directly relevant to chat-tuned models.
ACM CHI (Conference on Human Factors in Computing Systems, annual) is the HCI venue where conversational-UX research is published; useful counterweight to the model-focused ACL/EMNLP work. Track for "how do real users use chatbots" empirical studies.
Interspeech and ICASSP are the speech-and-audio venues where the latest STT, TTS, and speech-LM work is published. Track for voice-agent advances.
Workshop on Conversational AI (at NeurIPS / EMNLP / ACL) are the dialogue-specific co-located workshops that often have the freshest pre-publication work. Search "conversational AI workshop" plus the year.

40.5.3 Vendor engineering blogs

The 2023-26 production-state-of-the-art is documented mostly in vendor engineering blogs rather than academic papers. Treat the blogs as the empirical literature for production conversational AI; their case studies are often the only public reference for techniques that work at scale.

Anthropic Engineering Blog publishes the canonical 2024-26 production-AI engineering references, including "Building effective agents" (2024), "Prompt engineering for Claude" series, and various conversational-design posts. Read everything they publish on conversational design and tool use; their posts have unusually high signal-to-noise.
OpenAI Blog publishes the Realtime API launch (Oct 2024), the Assistants and Responses API documentation, and the periodic model-release posts that establish the chat-model frontier. Skim the model-release posts for what changed; read the engineering posts for production-pattern documentation.
Google AI Blog and Google DeepMind Blog publish Gemini's release notes and the Project Astra / Gemini Live work. Skim for the Google-specific updates; the DeepMind blog has stronger research content than the marketing-heavy Google AI blog.
LiveKit Engineering Blog and Pipecat Blog are the canonical voice-agent engineering blogs, covering latency budgets, barge-in design, and turn-detection details that no academic paper documents. Read both if voice agents are your area.
LangChain Blog and LlamaIndex Blog are the orchestration-framework blogs documenting new primitives, integrations, and reference architectures. Filter for the substantial posts; both publish heavily and not every release is significant.
Hugging Face Blog is the canonical reference for open-weights chat-model releases and the techniques behind them. Read for new open chat models and for the alignment-handbook line of work (the open RLHF recipes).
Deepgram Blog and ElevenLabs Blog are voice-specific blogs documenting STT and TTS advances and the latency / quality tradeoffs in production voice systems.
Voiceflow Blog and Botpress Blog document conversation-design patterns and platform-specific best practices. The Voiceflow blog has unusually thoughtful conversation-design content because the platform's audience is designer-led.
Evidently AI Blog publishes regular eval-and-observability content for chatbots and LLM apps; useful when evaluating production chat systems is your topic.
Eugene Yan's blog is the long-running practitioner blog with high-quality production-LLM posts, including chat-specific ones on evaluation and prompt patterns.
Simon Willison's blog and TILs is the highest-volume daily filter for LLM news; he posts working code snippets that often pre-date the formal documentation. Useful for "what is new this week" filtering.

40.5.4 Community hubs and Q&A

Where practitioners actually talk to each other about chatbots, conversational design, and voice agents.

r/ChatGPT, r/LocalLLaMA, r/ClaudeAI are the consumer-facing Reddits with surprisingly high signal for "what is going on with chat models this week". r/LocalLLaMA in particular is the best place to follow open-weights chat-model releases and self-hosting discussion.
r/dialogue_systems is smaller but more academic-focused; useful for tracking the dialogue-systems research community specifically.
r/MachineLearning hosts most paper-discussion threads and "show HN"-style releases; useful as a broad filter.
LangChain Discord, LlamaIndex Discord, and the framework-specific Discords are where users get help on integration-specific questions. Search before asking; many recurring questions have canonical answers pinned.
LiveKit Discord and Pipecat Discord are the voice-agent-builder community hubs in 2025-26. Voice agents have a tightly-knit specialist community that lives mostly on Discord because the topic is too niche for general AI fora.
Hugging Face Discord and HF Forums are the open-model community hubs; ask there for embedding, fine-tuning, and open chat-model questions.
X (Twitter) and Bluesky remain the broadcast layer for LLM news and quick takes. Curate aggressively (Andrej Karpathy, Sebastian Raschka, the Anthropic and OpenAI engineering accounts, the major open-model accounts) and ignore the rest. The signal-to-noise for general AI Twitter is poor; for specific curated lists it is excellent.
Hacker News is the de facto launch platform for new conversational-AI tools and platforms. Browse the daily front page for new releases; the discussion threads are often more informative than the launch posts.
AI Alignment Forum and LessWrong are the alignment-and-safety community hubs; read for the deeper conversations about chat-model safety, RLHF, and value alignment. Treat the broader LessWrong corpus with appropriate critical filtering but the AI-safety posts are often technically substantive.

40.5.5 The voice-agent subculture

Voice agents emerged as a tight, distinct community in 2024-25 once the realtime APIs and LiveKit / Pipecat frameworks made them widely buildable. The community publishes mostly in Discord, blog posts, and demo videos.

Daily.co Engineering Blog documents the WebRTC media-pipeline foundations behind Pipecat; useful when network and audio details matter (jitter buffers, codec choices, server-side mixing). Daily's open-source voice work feeds into Pipecat.
Kwindla Hultman Kramer (Daily.co founder) and David Zhao (LiveKit) are the prominent voice-agent practitioners on X; follow for voice-specific updates and demos. The voice-agent space is small enough that following the major framework authors covers most of what is published.
LiveKit YouTube and the Daily / Pipecat YouTube channels publish demo videos that are usually the easiest way to understand a voice-agent's actual UX (latency, barge-in feel, voice quality) which text descriptions undersell.
The Voice Tech Podcast and similar specialty podcasts cover the broader voice-AI industry and are useful for "what is the state of the voice agent market" overview content.

40.5.6 Curated resource lists

DAIR.AI Prompt Engineering Guide is the most-cited free guide to prompt engineering, with sections on chat-specific prompting patterns.
Awesome Prompt Engineering is the curated GitHub list of prompt-engineering resources; broad but uneven, useful as a starting point.
Awesome-LLM is the broader LLM-resource list, with a dedicated section on chat applications.
Awesome Conversational AI (when maintained) collects dialogue-systems-specific resources; check the last-commit date before relying on it.

40.5.7 Recommended reading paths

Key Insight: If you are starting from zero

Read Chapter 24 of Jurafsky & Martin, then the InstructGPT paper (for "why chat-tuned models behave differently"), then the Anthropic engineering blog's "Building effective agents" post (for "how to design a 2026-vintage assistant"), then the LMSYS Chatbot Arena paper (for "how do we evaluate chat models"). That is roughly four hours of reading and it covers the conceptual base.

Real-World Scenario: If you are shipping a voice agent

Who: A 2026 founding engineer planning the first voice agent on a small product team.

Situation: Leadership has signed off on a voice-first product and the engineer needs to choose between the OpenAI Realtime API, LiveKit Agents, Pipecat, and a Daily.co-style stack.

Problem: Each option has different latency, cost, and lock-in profiles, and architecture choices made before feeling the actual latency budget tend to be wrong in expensive ways.

Dilemma: Read more docs and pick on paper, or build a throwaway prototype and pick on the basis of real measured latency and ergonomics.

Decision: Read the canonical docs, then build a small prototype before drawing any production architecture.

How: Read the OpenAI Realtime API documentation, the LiveKit Agents quickstart, the Pipecat overview, and Daily.co's "Voice AI & Voice Agents" guide (search Daily's blog for the most recent version); watch one or two demo videos to calibrate latency expectations; then build a 50-line LiveKit Agent or Pipecat pipeline.

Result: A working voice prototype that exposes the real latency budget and surfaces vendor ergonomics before any irreversible architecture commitments.

Lesson: Production voice architecture decisions only make sense once you have felt the latency budget firsthand, so build a 50-line prototype before any whiteboard session.

Note: If you are picking a chat model

Look at the LMSYS Arena leaderboard in your target language as the first filter, read the most recent model-release post from each shortlisted vendor, then run your own held-out eval (200 hand-graded conversations is enough). The public benchmarks tell you "which models are in the top tier"; only your own eval tells you which one is right for your application.

Fun Fact: The ELIZA effect, 1966-2026

Joseph Weizenbaum's ELIZA (1966), a 200-line Lisp program that pattern-matched on Rogerian therapist phrases ("How does that make you feel?"), produced the original observation that humans attribute understanding to systems that reliably reflect their input. Sixty years later the ELIZA effect is alive and well: a substantial fraction of human Character.AI users report meaningful emotional relationships with their characters, and clinical research on AI companions takes the effect seriously. Read Weizenbaum's Computer Power and Human Reason (1976) for the original critique; the contemporary literature on parasocial AI relationships builds directly on it.

40.5.8 Newsletters and podcasts

The newsletter and podcast layer fills a specific need: medium-form weekly synthesis of what happened, with editorial filtering. The conversational AI corner of this:

Latent Space (Swyx and Alessio) is the practitioner-focused podcast and newsletter, with episodes on voice agents, agent frameworks, and chat model releases. Pick for medium-form interviews with the people building the tools.
Eugene Yan's newsletter publishes monthly synthesis of practitioner LLM work, with chat-specific entries.
Nathan Lambert's Interconnects is the long-form newsletter on alignment, RLHF, and chat-model recipes; useful for the deeper "why does chat tuning work the way it does" content.
The Algorithmic Bridge is a more popular-press-style newsletter covering the broader AI news with a chat-model focus.
Zvi Mowshowitz's "Don't Worry About the Vase" publishes a weekly LLM news roundup with extensive coverage of chat-model releases and safety incidents.
Jack Clark's Import AI is a long-running weekly AI-news newsletter; broader than conversational AI but consistently high signal.

40.5.9 Conversation design as a discipline

Conversation design (the craft of writing the words a bot says and the turn structure of the dialogue) is its own discipline distinct from prompt engineering or ML. The 2026 reading list:

Cathy Pearl, "Designing Voice User Interfaces" (O'Reilly, 2016) is the pre-LLM foundational book on voice UI design. The patterns (graceful repair, error handling, confirmation dialogs) are still load-bearing for voice agents in 2026 because they reflect the underlying speech-perception constraints that have not changed.
Erika Hall, "Conversational Design" (A Book Apart, 2018) is the conversational-UX equivalent for text interfaces. The principles (cooperative principle, Grice's maxims, turn-taking) anchor the conversation-design vocabulary.
The Botsociety blog and Voicebot.ai publish ongoing conversation-design case studies. Read for application-specific patterns.
Conversation Design Institute blog hosts a community of conversation designers; useful when the question is "how do we structure a multi-turn dialogue" rather than "what model should we use".

40.5.10 A weekly current-awareness routine

For practitioners who want a reasonable weekly habit, the recommended routine in 2026:

Monday: skim the LMSYS Arena leaderboard for any movement; check the latest entries on the Anthropic, OpenAI, and Google AI engineering blogs.
Wednesday: scan Hacker News and r/LocalLLaMA for new releases.
Friday: read one newsletter (Interconnects, Latent Space, or Don't Worry About the Vase) and one paper that caught attention that week.
Monthly: re-run your internal eval against the current top-3 chat models in your tier; check whether any new release outperforms your current pick.
Quarterly: read one academic survey (the latest from arXiv with "chat" or "dialogue" or "conversational AI" in the title) to catch up on research directions you may have missed.

This routine takes roughly 4-6 hours per week and is sufficient to keep current without drowning in noise.

Fun Fact: The half-life of "must-read"

A useful sanity check: take any "must-read papers on LLMs" list from 2023 and look at it in 2026. About half the papers feel important still; about half feel like historical curiosities. Don't try to read everything as it happens; let the field's collective filtering do the first pass for you. Things that are genuinely important still get talked about six months later; things that are not, drop off. The "I read every arXiv abstract on Twitter" approach is a known time-sink for diminishing returns.

40.5.11 Research groups and labs to follow

Worth tracking the publication output of specific research groups that have been particularly productive on conversational AI in 2024-26:

Stanford NLP Group publishes regularly on dialogue, RLHF, and chat alignment. The DSPy work originated here.
UC Berkeley NLP / LMSYS hosts the Chatbot Arena, publishes MT-Bench, and is the source of much of the modern chat-evaluation methodology.
CMU LTI (Language Technologies Institute) has a long dialogue-systems tradition; many of the multi-hop QA and grounded dialogue datasets come from here.
University of Edinburgh and Heriot-Watt (Centre for Doctoral Training in Robotics) publish on spoken dialogue and voice agents; some of the foundational TOD work happened here.
UKP Lab (TU Darmstadt) produced sentence-transformers and BEIR; in conversational AI they publish on persona consistency and multi-modal dialogue.
Facebook AI Research (now Meta AI) remains the largest single source of open-domain chat models and datasets (PersonaChat, BlendedSkillTalk, BlenderBot, Llama).
Microsoft Research publishes on dialogue systems, RLHF, and the broader LLM stack; some Phi and Orca papers are chat-relevant.
Anthropic and OpenAI research teams publish less than they used to but their alignment papers (Constitutional AI, RLHF variants, model cards) are foundational reading.
Google DeepMind publishes on multimodal conversation, voice models (Gemini Live), and alignment.
Allen Institute for AI (AI2) publishes on open-source LLMs (OLMo, Tülu), instruction tuning, and conversational red-teaming.

40.5.12 Conferences to attend or stream

Prefer talks over papers? Five venues do most of the work:

AI Engineer Summit (annual, Latent Space) is the production-AI-engineering equivalent of an academic conference, with chat and voice tracks. The talks are recorded and posted to YouTube; useful for "what is shipping in production".
The Sequence AI newsletter's "papers we should have read" series filters academic conference output into practitioner-relevant lists.
NeurIPS / ICML / ACL workshops on conversational AI are the academic counterpart; YouTube usually has recordings within a few months.
RE-WORK Deep Learning Summit hosts industry-focused AI conferences with conversational AI tracks; useful for industry case studies.
Vendor conferences (Google Cloud Next, Microsoft Build, AWS re:Invent, OpenAI DevDay, Anthropic's annual events) are increasingly substantive on chat applications. The vendor bias is obvious but the implementation detail is real.

40.5.13 Reading lists by role

Reading priorities split sharply by role. Pick the one that fits your job:

For an ML engineer building chat models: InstructGPT, Constitutional AI, the Llama-3 technical report, the Chatbot Arena paper, the HarmBench paper, plus the most recent state-of-the-art chat-tuning paper. Read at least one new alignment paper per month.
For a product manager on a chat product: Erika Hall's "Conversational Design", Cathy Pearl's "Designing Voice User Interfaces", the LMSYS Arena leaderboard methodology, plus the vendor engineering blogs from your selected LLM provider. Skim the AI Engineer Summit talks on the chat track.
For a conversation designer: Erika Hall, Cathy Pearl, the Voiceflow blog, the Conversation Design Institute resources. The model and library churn is largely tangential; the conversation-design principles are stable.
For a voice-agent builder: the LiveKit Engineering Blog, the Pipecat documentation, Daily.co's voice AI guide, the OpenAI Realtime API documentation, the Moshi technical report. Subscribe to the LiveKit and Pipecat Discords.
For an alignment / safety researcher: Constitutional AI, the Anthropic Persuasion paper, HarmBench, the Anthropic and OpenAI safety reports, the Alignment Forum top posts. Track the AI safety conferences (NeurIPS Safety Workshop, ICLR alignment workshop).
For a researcher entering conversational AI from outside: Jurafsky & Martin Chapter 24, McTear's textbook, the Chen et al. 2017 survey, the InstructGPT paper, the Chatbot Arena paper. That gets the vocabulary; pick a recent ACL or EMNLP proceedings to find the current frontier.

40.5.14 The historical context that still matters

Conversational AI's history shapes its present in ways that are obvious once you see them. Three layers are worth knowing.

The symbolic era (1960s-1990s) built dialogue systems around explicit rules and grammars: ELIZA's keyword-substitution patterns, PARRY's affect-driven response selection, and the Schank-Abelson script frame. The era's empirical legacy is the vocabulary the field still uses (dialogue acts, slots, frames, discourse relations). Many practitioners in 2026 underestimate how much of the modern stack is still organized around these symbolic concepts even when the implementation is a neural network. Reading Weizenbaum's 1976 critique alongside the original ELIZA paper is the most efficient single hour for grasping what changes and what does not.

The statistical / neural era (1990s-2018, accelerating after 2014) brought dialogue state tracking, end-to-end sequence-to-sequence dialogue, the DSTC competition series, and the canonical Vinyals-Le 2015 paper that showed seq2seq could generate plausible conversational responses. The era's empirical legacy is the dialogue-state tracking literature (MultiWOZ, SGD, the DSTC schema) and the evaluation methodology (per-slot F1, joint goal accuracy, BLEU as response-quality proxy). Most of the TOD work in 2026 builds directly on this foundation.

The LLM era (2018-present, accelerating with ChatGPT in 2022) collapsed most of the explicit pipeline into a single model, replacing intent classification, slot extraction, dialogue policy, and response generation with a single instruction-tuned transformer. The era's empirical legacy is RLHF, Constitutional AI, the chat-quality benchmark stack (MT-Bench, Arena), and the modern voice-aware models. The intellectual surprise of this era was that scale plus alignment data was enough; the explicit symbolic structure the field had built was largely unnecessary at the application layer.

The synthesis 2026 conversational AI practitioners arrive at: use LLMs for understanding and generation, but keep the symbolic structure (explicit dialogue states, policy graphs, slot annotations) where audit and policy enforcement matter. The Rasa CALM design, the Dialogflow CX Generative Agents overlay, and many production deployments are explicit hybrids that combine all three eras' contributions.

Looking Back: where Part VIII landed conversational AI

Part VIII (Chapters 37-41) traced the full arc from the conversational-AI design vocabulary through to the tools-of-the-trade catalog you have just finished. Chapter 37 anchored the practitioner's design surface (system prompts, conversation memory, persona, refusal). Chapter 38 covered task-oriented dialogue and the slot-and-frame heritage that still shows through every CALM and Dialogflow CX deployment. Chapter 39 covered open-domain and character chat, including the persona-grounding and emotion-aware threads that lead into companion AI. Chapter 39 covered voice, realtime, and multimodal interfaces, the latency-budget arithmetic from Section 40.1.3 was inherited from there. Chapter 40 then layered the vendor and library catalog on top: platforms (40.1), libraries (40.2), datasets and benchmarks (40.3), models (40.4), and the reading-list canon (this section). The thread: conversational AI is a discipline with both a 60-year symbolic-systems heritage and a 4-year LLM-collapsed-everything frontier; the practitioner who knows both is unusual, and the catalogs above are organized to keep them within reach.

What's Next: Part IX takes evaluation seriously

Continue to Section 42.1: LLM Evaluation Fundamentals. Chapter 40 spent considerable energy on benchmarks (Section 40.3) and the LLM-as-judge bias mitigations (Section 40.3.8) precisely because evaluation is the next major topic. Part IX (Chapters 42-46) treats LLM evaluation as a discipline in its own right: foundations of metric design and statistical evaluation (Chapter 42), specialized evaluation methods including the Bradley-Terry estimators we touched in Section 40.3.3 and the length-controlled regressions of AlpacaEval (Chapter 43), production observability that turns the benchmark stack into live monitoring (Chapter 44), red-teaming and adversarial evaluation that extends HarmBench / JailbreakBench (Chapter 45), and the evaluation tools-of-the-trade catalog (Chapter 46). If Chapters 37-41 were "build a conversational system," Part IX is "know whether what you built is actually working." The two halves compose: every production conversational AI team in 2026 spends more engineering time on evaluation infrastructure than on the model itself.

40.5.15 Where the field is going (a 2026 read)

Four directions worth tracking:

Speech-to-speech models eating the cascaded pipeline: GPT-4o Realtime, Gemini Live, and the open Moshi are early entrants in what will likely be the dominant architecture for voice agents by 2027-2028. The interesting open questions are around emotion-aware response and barge-in.
Long-running memory becoming a first-class primitive: Zep, Mem0, and the broader memory-layer libraries are early. The space needs better evaluation methodology (what does "remembers correctly" mean in a 6-month-old conversation?) and better contradiction-handling.
Agentic conversation: chat models doing multi-step tool use with the user in the loop. The OpenAI Agents SDK, the AG-UI protocol, and the broader multi-agent conversation work are setting the foundations for this; the open questions are around when to defer to the user vs proceed autonomously.
Companion-and-character AI as its own product category: Character.AI, Inworld, and the various companion startups have shown there is a substantial market for non-assistant conversational AI. The research community is starting to take this seriously; expect more academic work on persona consistency, parasocial relationships, and the safety considerations specific to long-running companionship.

The right reading habit for staying current on these directions is to subscribe to one paper-tracking newsletter and one practitioner blog and let the field's collective filtering do the rest. Ultimately, as Figure 40.5.1 suggests, the most durable resource is not any single feed but the overlapping communities that keep teaching each other.

Figure 40.5.1: Staying current is a group activity. No single newsletter keeps pace with conversational AI; the high-signal knowledge lives in the overlap between the research, framework, voice-agent, and design communities catalogued in this section.

Further Reading

Jurafsky, D., & Martin, J. H. (2024). "Speech and Language Processing (3rd ed. draft), Chapter 24: Chatbots and Dialogue Systems." Stanford University. web.stanford.edu/~jurafsky/slp3. The canonical textbook chapter on conversational AI; the field's grammar.

McTear, M. (2022). "Conversational AI: Dialogue Systems, Conversational Agents, and Chatbots, 2nd ed." Morgan & Claypool Publishers. link.springer.com/book/10.1007/978-3-031-02176-3. The most complete recent dialogue-systems textbook; the natural complement to the Jurafsky & Martin chapter.

Ouyang, L., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback." NeurIPS 2022. arXiv:2203.02155. The InstructGPT paper; canonical reference for modern chat-tuning recipes.

Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." Anthropic. arXiv:2212.08073. The Constitutional AI paper; the alternative chat-alignment recipe to InstructGPT-style RLHF.

Anthropic (2024). "Building effective agents." Anthropic Engineering Blog, December 2024. anthropic.com/research/building-effective-agents. The most-cited 2024 reference for production-AI design patterns including conversational ones.

Weizenbaum, J. (1976). "Computer Power and Human Reason: From Judgment to Calculation." W. H. Freeman. The original critique of the ELIZA effect and machine-mediated conversation; foundational reading for thinking about parasocial AI relationships.

Conversational AI History and Foundations

Weizenbaum, J. (1966). "ELIZA: A Computer Program For the Study of Natural Language Communication Between Man And Machine." Communications of the ACM, 9(1). dl.acm.org/doi/10.1145/365153.365168

Young, S., Gasic, M., Thomson, B., & Williams, J. D. (2013). "POMDP-Based Statistical Spoken Dialog Systems: A Review." Proceedings of the IEEE, 101(5). ieeexplore.ieee.org/document/6407655

Williams, J. D., & Young, S. (2007). "Partially observable Markov decision processes for spoken dialog systems." Computer Speech and Language, 21(2). sciencedirect.com/science/article/pii/S0885230806000283

Jurafsky, D., & Martin, J. H. (2024). "Speech and Language Processing," 3rd ed. draft, Ch. 24: Chatbots and Dialogue Systems. web.stanford.edu/~jurafsky/slp3

Dialogue Datasets and Benchmarks

Budzianowski, P., Wen, T.-H., Tseng, B.-H., et al. (2018). "MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling." EMNLP 2018. arXiv:1810.00278

Zhang, S., Dinan, E., Urbanek, J., et al. (2018). "Personalizing Dialogue Agents: I have a dog, do you have pets too?" (PersonaChat). ACL 2018. arXiv:1801.07243

Rashkin, H., Smith, E. M., Li, M., & Boureau, Y.-L. (2019). "Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset" (EmpatheticDialogues). ACL 2019. arXiv:1811.00207

Dialogue and Chat Evaluation

Zheng, L., Chiang, W.-L., Sheng, Y., et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023. arXiv:2306.05685

Bai, Y., Jones, A., Ndousse, K., et al. (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback" (HH-RLHF). Anthropic. arXiv:2204.05862

Mazeika, M., Phan, L., Yin, X., et al. (2024). "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal." ICML 2024. arXiv:2402.04249

Arena Rankings and Instruction Tuning

Chiang, W.-L., Zheng, L., Sheng, Y., et al. (2024). "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference." ICML 2024 (LMSYS). arXiv:2403.04132

Ouyang, L., Wu, J., Jiang, X., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback" (InstructGPT). NeurIPS 2022. arXiv:2203.02155

Bai, Y., Kadavath, S., Kundu, S., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." Anthropic. arXiv:2212.08073

McTear, M. (2022). "Conversational AI: Dialogue Systems, Conversational Agents, and Chatbots," 2nd ed. Morgan and Claypool. link.springer.com/book/10.1007/978-3-031-02176-3