External Reading and Communities

Section 40.5

"Jurafsky for the vocabulary, ACL for the research, the LiveKit Discord for the latency budgets, and Simon Willison for what shipped this morning. Voice agents have their own subreddit now."

PromptPrompt, Multi-Channel Reading-List AI Agent
Big Picture

Conversational AI is unusually well-covered in writing because the topic sits at the intersection of NLP research (which has an active venue circuit), production engineering (which generates blog posts and conference talks), and consumer product design (which generates think-pieces and case studies). This section organizes the canon: foundational papers and pre-LLM textbooks that still teach the right vocabulary, academic venues that publish current research, vendor engineering blogs that document the production state of the art, community hubs where practitioners actually talk to each other, and the voice-agent subculture that emerged in 2024-25 as the realtime APIs shipped.

Prerequisites

This is an end-of-chapter reading list and assumes familiarity with the conversational-AI modules in Part VIII.

The single most important meta-skill for staying current on conversational AI is knowing where to look. The field moves fast enough that any list dates within a year, but the venues and communities that publish high-signal work are reasonably stable. The list below is organized by source type rather than topic so you can scan it for "where do I look when X happens?" rather than "what is the latest news on Y".

40.5.1 Foundational papers and textbooks

These are the long-cited papers and the pre-LLM textbooks that still teach the canonical vocabulary (dialogue acts, belief state, dialogue policy, response generation). A modern conversational AI practitioner can lose the field's history but not its taxonomy.

40.5.2 Academic venues for current research

The active conferences and workshops that publish state-of-the-art conversational AI research.

40.5.3 Vendor engineering blogs

The 2023-26 production-state-of-the-art is documented mostly in vendor engineering blogs rather than academic papers. Treat the blogs as the empirical literature for production conversational AI; their case studies are often the only public reference for techniques that work at scale.

40.5.4 Community hubs and Q&A

Where practitioners actually talk to each other about chatbots, conversational design, and voice agents.

40.5.5 The voice-agent subculture

Voice agents emerged as a tight, distinct community in 2024-25 once the realtime APIs and LiveKit / Pipecat frameworks made them widely buildable. The community publishes mostly in Discord, blog posts, and demo videos.

40.5.6 Curated resource lists

Key Insight: If you are starting from zero

Read Chapter 24 of Jurafsky & Martin, then the InstructGPT paper (for "why chat-tuned models behave differently"), then the Anthropic engineering blog's "Building effective agents" post (for "how to design a 2026-vintage assistant"), then the LMSYS Chatbot Arena paper (for "how do we evaluate chat models"). That is roughly four hours of reading and it covers the conceptual base.

Real-World Scenario: If you are shipping a voice agent

Who: A 2026 founding engineer planning the first voice agent on a small product team.

Situation: Leadership has signed off on a voice-first product and the engineer needs to choose between the OpenAI Realtime API, LiveKit Agents, Pipecat, and a Daily.co-style stack.

Problem: Each option has different latency, cost, and lock-in profiles, and architecture choices made before feeling the actual latency budget tend to be wrong in expensive ways.

Dilemma: Read more docs and pick on paper, or build a throwaway prototype and pick on the basis of real measured latency and ergonomics.

Decision: Read the canonical docs, then build a small prototype before drawing any production architecture.

How: Read the OpenAI Realtime API documentation, the LiveKit Agents quickstart, the Pipecat overview, and Daily.co's "Voice AI & Voice Agents" guide (search Daily's blog for the most recent version); watch one or two demo videos to calibrate latency expectations; then build a 50-line LiveKit Agent or Pipecat pipeline.

Result: A working voice prototype that exposes the real latency budget and surfaces vendor ergonomics before any irreversible architecture commitments.

Lesson: Production voice architecture decisions only make sense once you have felt the latency budget firsthand, so build a 50-line prototype before any whiteboard session.

Note: If you are picking a chat model

Look at the LMSYS Arena leaderboard in your target language as the first filter, read the most recent model-release post from each shortlisted vendor, then run your own held-out eval (200 hand-graded conversations is enough). The public benchmarks tell you "which models are in the top tier"; only your own eval tells you which one is right for your application.

Fun Fact: The ELIZA effect, 1966-2026

Joseph Weizenbaum's ELIZA (1966), a 200-line Lisp program that pattern-matched on Rogerian therapist phrases ("How does that make you feel?"), produced the original observation that humans attribute understanding to systems that reliably reflect their input. Sixty years later the ELIZA effect is alive and well: a substantial fraction of human Character.AI users report meaningful emotional relationships with their characters, and clinical research on AI companions takes the effect seriously. Read Weizenbaum's Computer Power and Human Reason (1976) for the original critique; the contemporary literature on parasocial AI relationships builds directly on it.

40.5.8 Newsletters and podcasts

The newsletter and podcast layer fills a specific need: medium-form weekly synthesis of what happened, with editorial filtering. The conversational AI corner of this:

40.5.9 Conversation design as a discipline

Conversation design (the craft of writing the words a bot says and the turn structure of the dialogue) is its own discipline distinct from prompt engineering or ML. The 2026 reading list:

40.5.10 A weekly current-awareness routine

For practitioners who want a reasonable weekly habit, the recommended routine in 2026:

  1. Monday: skim the LMSYS Arena leaderboard for any movement; check the latest entries on the Anthropic, OpenAI, and Google AI engineering blogs.
  2. Wednesday: scan Hacker News and r/LocalLLaMA for new releases.
  3. Friday: read one newsletter (Interconnects, Latent Space, or Don't Worry About the Vase) and one paper that caught attention that week.
  4. Monthly: re-run your internal eval against the current top-3 chat models in your tier; check whether any new release outperforms your current pick.
  5. Quarterly: read one academic survey (the latest from arXiv with "chat" or "dialogue" or "conversational AI" in the title) to catch up on research directions you may have missed.

This routine takes roughly 4-6 hours per week and is sufficient to keep current without drowning in noise.

Fun Fact: The half-life of "must-read"

A useful sanity check: take any "must-read papers on LLMs" list from 2023 and look at it in 2026. About half the papers feel important still; about half feel like historical curiosities. Don't try to read everything as it happens; let the field's collective filtering do the first pass for you. Things that are genuinely important still get talked about six months later; things that are not, drop off. The "I read every arXiv abstract on Twitter" approach is a known time-sink for diminishing returns.

40.5.11 Research groups and labs to follow

Worth tracking the publication output of specific research groups that have been particularly productive on conversational AI in 2024-26:

40.5.12 Conferences to attend or stream

Prefer talks over papers? Five venues do most of the work:

40.5.13 Reading lists by role

Reading priorities split sharply by role. Pick the one that fits your job:

40.5.14 The historical context that still matters

Conversational AI's history shapes its present in ways that are obvious once you see them. Three layers are worth knowing.

The symbolic era (1960s-1990s) built dialogue systems around explicit rules and grammars: ELIZA's keyword-substitution patterns, PARRY's affect-driven response selection, and the Schank-Abelson script frame. The era's empirical legacy is the vocabulary the field still uses (dialogue acts, slots, frames, discourse relations). Many practitioners in 2026 underestimate how much of the modern stack is still organized around these symbolic concepts even when the implementation is a neural network. Reading Weizenbaum's 1976 critique alongside the original ELIZA paper is the most efficient single hour for grasping what changes and what does not.

The statistical / neural era (1990s-2018, accelerating after 2014) brought dialogue state tracking, end-to-end sequence-to-sequence dialogue, the DSTC competition series, and the canonical Vinyals-Le 2015 paper that showed seq2seq could generate plausible conversational responses. The era's empirical legacy is the dialogue-state tracking literature (MultiWOZ, SGD, the DSTC schema) and the evaluation methodology (per-slot F1, joint goal accuracy, BLEU as response-quality proxy). Most of the TOD work in 2026 builds directly on this foundation.

The LLM era (2018-present, accelerating with ChatGPT in 2022) collapsed most of the explicit pipeline into a single model, replacing intent classification, slot extraction, dialogue policy, and response generation with a single instruction-tuned transformer. The era's empirical legacy is RLHF, Constitutional AI, the chat-quality benchmark stack (MT-Bench, Arena), and the modern voice-aware models. The intellectual surprise of this era was that scale plus alignment data was enough; the explicit symbolic structure the field had built was largely unnecessary at the application layer.

The synthesis 2026 conversational AI practitioners arrive at: use LLMs for understanding and generation, but keep the symbolic structure (explicit dialogue states, policy graphs, slot annotations) where audit and policy enforcement matter. The Rasa CALM design, the Dialogflow CX Generative Agents overlay, and many production deployments are explicit hybrids that combine all three eras' contributions.

Looking Back: where Part VIII landed conversational AI

Part VIII (Chapters 37-41) traced the full arc from the conversational-AI design vocabulary through to the tools-of-the-trade catalog you have just finished. Chapter 37 anchored the practitioner's design surface (system prompts, conversation memory, persona, refusal). Chapter 38 covered task-oriented dialogue and the slot-and-frame heritage that still shows through every CALM and Dialogflow CX deployment. Chapter 39 covered open-domain and character chat, including the persona-grounding and emotion-aware threads that lead into companion AI. Chapter 39 covered voice, realtime, and multimodal interfaces, the latency-budget arithmetic from Section 40.1.3 was inherited from there. Chapter 40 then layered the vendor and library catalog on top: platforms (40.1), libraries (40.2), datasets and benchmarks (40.3), models (40.4), and the reading-list canon (this section). The thread: conversational AI is a discipline with both a 60-year symbolic-systems heritage and a 4-year LLM-collapsed-everything frontier; the practitioner who knows both is unusual, and the catalogs above are organized to keep them within reach.

What's Next: Part IX takes evaluation seriously

Continue to Section 42.1: LLM Evaluation Fundamentals. Chapter 40 spent considerable energy on benchmarks (Section 40.3) and the LLM-as-judge bias mitigations (Section 40.3.8) precisely because evaluation is the next major topic. Part IX (Chapters 42-46) treats LLM evaluation as a discipline in its own right: foundations of metric design and statistical evaluation (Chapter 42), specialized evaluation methods including the Bradley-Terry estimators we touched in Section 40.3.3 and the length-controlled regressions of AlpacaEval (Chapter 43), production observability that turns the benchmark stack into live monitoring (Chapter 44), red-teaming and adversarial evaluation that extends HarmBench / JailbreakBench (Chapter 45), and the evaluation tools-of-the-trade catalog (Chapter 46). If Chapters 37-41 were "build a conversational system," Part IX is "know whether what you built is actually working." The two halves compose: every production conversational AI team in 2026 spends more engineering time on evaluation infrastructure than on the model itself.

40.5.15 Where the field is going (a 2026 read)

Four directions worth tracking:

The right reading habit for staying current on these directions is to subscribe to one paper-tracking newsletter and one practitioner blog and let the field's collective filtering do the rest. Ultimately, as Figure 40.5.1 suggests, the most durable resource is not any single feed but the overlapping communities that keep teaching each other.

A cozy cafe study table where a Reddit avatar reading r/Rag, an orange Vercel engineer pair-coding, a Voiceflow designer with sticky notes, and a Discord avatar all share one whiteboard labeled today's RAG patterns.
Figure 40.5.1: Staying current is a group activity. No single newsletter keeps pace with conversational AI; the high-signal knowledge lives in the overlap between the research, framework, voice-agent, and design communities catalogued in this section.
Further Reading
Jurafsky, D., & Martin, J. H. (2024). "Speech and Language Processing (3rd ed. draft), Chapter 24: Chatbots and Dialogue Systems." Stanford University. web.stanford.edu/~jurafsky/slp3. The canonical textbook chapter on conversational AI; the field's grammar.
McTear, M. (2022). "Conversational AI: Dialogue Systems, Conversational Agents, and Chatbots, 2nd ed." Morgan & Claypool Publishers. link.springer.com/book/10.1007/978-3-031-02176-3. The most complete recent dialogue-systems textbook; the natural complement to the Jurafsky & Martin chapter.
Ouyang, L., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback." NeurIPS 2022. arXiv:2203.02155. The InstructGPT paper; canonical reference for modern chat-tuning recipes.
Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." Anthropic. arXiv:2212.08073. The Constitutional AI paper; the alternative chat-alignment recipe to InstructGPT-style RLHF.
Anthropic (2024). "Building effective agents." Anthropic Engineering Blog, December 2024. anthropic.com/research/building-effective-agents. The most-cited 2024 reference for production-AI design patterns including conversational ones.
Weizenbaum, J. (1976). "Computer Power and Human Reason: From Judgment to Calculation." W. H. Freeman. The original critique of the ELIZA effect and machine-mediated conversation; foundational reading for thinking about parasocial AI relationships.

Conversational AI History and Foundations

Weizenbaum, J. (1966). "ELIZA: A Computer Program For the Study of Natural Language Communication Between Man And Machine." Communications of the ACM, 9(1). dl.acm.org/doi/10.1145/365153.365168
Young, S., Gasic, M., Thomson, B., & Williams, J. D. (2013). "POMDP-Based Statistical Spoken Dialog Systems: A Review." Proceedings of the IEEE, 101(5). ieeexplore.ieee.org/document/6407655
Williams, J. D., & Young, S. (2007). "Partially observable Markov decision processes for spoken dialog systems." Computer Speech and Language, 21(2). sciencedirect.com/science/article/pii/S0885230806000283
Jurafsky, D., & Martin, J. H. (2024). "Speech and Language Processing," 3rd ed. draft, Ch. 24: Chatbots and Dialogue Systems. web.stanford.edu/~jurafsky/slp3

Dialogue Datasets and Benchmarks

Budzianowski, P., Wen, T.-H., Tseng, B.-H., et al. (2018). "MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling." EMNLP 2018. arXiv:1810.00278
Zhang, S., Dinan, E., Urbanek, J., et al. (2018). "Personalizing Dialogue Agents: I have a dog, do you have pets too?" (PersonaChat). ACL 2018. arXiv:1801.07243
Rashkin, H., Smith, E. M., Li, M., & Boureau, Y.-L. (2019). "Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset" (EmpatheticDialogues). ACL 2019. arXiv:1811.00207

Dialogue and Chat Evaluation

Zheng, L., Chiang, W.-L., Sheng, Y., et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023. arXiv:2306.05685
Bai, Y., Jones, A., Ndousse, K., et al. (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback" (HH-RLHF). Anthropic. arXiv:2204.05862
Mazeika, M., Phan, L., Yin, X., et al. (2024). "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal." ICML 2024. arXiv:2402.04249

Arena Rankings and Instruction Tuning

Chiang, W.-L., Zheng, L., Sheng, Y., et al. (2024). "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference." ICML 2024 (LMSYS). arXiv:2403.04132
Ouyang, L., Wu, J., Jiang, X., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback" (InstructGPT). NeurIPS 2022. arXiv:2203.02155
Bai, Y., Kadavath, S., Kundu, S., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." Anthropic. arXiv:2212.08073
McTear, M. (2022). "Conversational AI: Dialogue Systems, Conversational Agents, and Chatbots," 2nd ed. Morgan and Claypool. link.springer.com/book/10.1007/978-3-031-02176-3