Chapter 64: Workflow Orchestration & Durable Execution

Chapter opener illustration: Workflow Orchestration & Durable Execution.

"Orchestration is the difference between an LLM script and an LLM platform."
Deploy, Workflow-Watching AI Agent

Looking Back

Chapter 63 routed requests; this chapter routes workflows. Temporal, Inngest, Airflow with LLM operators, Prefect, durable execution, retries with backoff, and the workflow patterns that turn long-running LLM jobs into reliable async pipelines.

Big Picture

LLM-powered applications often span hours of work , chained tool calls, human review, retries, and long-running document processing. This chapter covers durable workflow engines (Temporal, AWS Step Functions, Airflow) and the patterns that make stateful agent workflows reliable.

Chapter Overview

Agent workflows that span minutes or hours will fail, and losing progress on a 20-step research pipeline is unacceptable. This chapter teaches durable execution: the workflow orchestration patterns that survive retries, partial failures, and provider outages, the canonical engines (Temporal, Restate, Inngest), the LLM-specific patterns (long-running agent loops, checkpointed retrievals, human-in-the-loop steps), and the trade-offs between durability, latency, and cost.

Workflow orchestration is what makes multi-step agents production-grade. This chapter is the syllabus for the runtimes and patterns that survive contact with reality.

Note: Learning Objectives

Explain why long-running LLM workflows need durable execution rather than ad-hoc retry logic.
Compare Temporal, Restate, and Inngest as workflow engines for LLM applications.
Architect a checkpointed agent loop with retries and partial-failure recovery.
Apply human-in-the-loop checkpoints inside a durable workflow.
Trade off durability, latency, and cost for a target workflow.

Sections in This Chapter

Prerequisites

Production engineering from Chapter 62
Agent foundations from Chapter 26
Familiarity with at least one workflow engine (Airflow, Temporal, Step Functions)

What's Next?

This chapter begins with Section 64.1: The Case for Durable Execution. Each section builds on the previous one, so we recommend reading them in order.

Further Reading

Durable Execution & Workflow Systems

Microsoft Research (2021). "Durable Functions: Semantics for Stateful Serverless." OOPSLA 2021. ACM DL. The formal semantics of durable workflows that systems like Temporal and Restate implement, the substrate for long-running LLM agent runs.

Sherstinsky, A. (2021). "Why Workflow Engines Matter and How to Choose One." arXiv preprint. arXiv:2104.04576. A taxonomy of workflow-engine designs (DAG, durable, event-driven) that orients the architectural choices for orchestrating multi-step LLM applications.

Orchestrating Agent Pipelines

Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vardhamanan, S., et al. (2024). "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines." ICLR. arXiv:2310.03714. A declarative orchestration framework for LLM pipelines with prompt-compilation as a first-class operator, the closest LLM-native analogue of a workflow engine.

Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., et al. (2023). "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation." arXiv preprint. arXiv:2308.08155. The multi-agent orchestration pattern that makes durable, conversational state explicit, which workflow systems must support.