Before comparing individual tools, you need a mental map of the categories they belong to and the criteria that matter for selection. This section establishes six tool categories (orchestration, retrieval, evaluation, serving, observability, and fine-tuning) and a structured decision framework for evaluating any tool against your project's constraints.
1. The Six Categories of LLM Tooling
The LLM tooling ecosystem can be organized into six functional categories. Each category addresses a distinct phase of the AI application lifecycle, from development through production operations. Some tools span multiple categories (LangChain touches orchestration and retrieval, for example), but understanding the primary categories helps you avoid both gaps and redundancy in your stack.
| Category | Purpose | Example Tools | Lifecycle Phase |
|---|---|---|---|
| Orchestration | Chain LLM calls, manage prompts, coordinate multi-step workflows | LangChain, LlamaIndex, Haystack, DSPy | Development |
| Retrieval | Index, embed, search, and retrieve documents for RAG pipelines | LlamaIndex, Haystack, Chroma, Pinecone, Weaviate | Development, Production |
| Agent Frameworks | Build autonomous or semi-autonomous agents with tool use and planning | LangGraph, CrewAI, AutoGen, OpenAI Agents SDK | Development |
| Evaluation | Test prompt quality, measure LLM output accuracy, run regression suites | promptfoo, DeepEval, RAGAS, Eleuther Eval Harness | Development, CI/CD |
| Observability | Trace calls, monitor costs, debug production issues, log interactions | LangSmith, Langfuse, Phoenix, W&B Weave | Production |
| Serving & Fine-Tuning | Deploy models for inference, customize models on domain data | vLLM, TGI, SGLang, Ollama, Unsloth, Axolotl | Production, Training |
2. Selection Criteria: What Matters When Choosing Tools
Selecting tools based solely on GitHub stars or Twitter hype leads to regret. A structured evaluation across multiple dimensions produces better outcomes. The following eight criteria apply universally across all six categories.
2.1 Maturity and Stability
Maturity encompasses version stability, API consistency, and backward compatibility. A tool at version 0.3 with weekly breaking changes imposes ongoing maintenance costs that a stable 1.x release does not. Check the changelog frequency: rapid iteration signals active development but may also signal an unstable API surface.
2.2 Community Size and Activity
Community size correlates with the availability of tutorials, Stack Overflow answers, third-party integrations, and bug reports. Useful metrics include GitHub stars, monthly PyPI downloads, Discord or Slack community size, and the number of open versus closed issues. A large community with a high issue-closure rate indicates both adoption and responsive maintenance.
2.3 Enterprise Support and Licensing
For production deployments, licensing and support matter. Open-source tools under permissive licenses (MIT, Apache 2.0) allow unrestricted commercial use. Some tools offer dual licensing: a free open-source tier and a paid enterprise tier with SLAs, SSO, and audit logging. Evaluate whether the open-source tier meets your compliance requirements or whether you need the commercial offering.
2.4 Integration Breadth
No tool operates in isolation. Evaluate how well a tool integrates with the rest of your stack: LLM providers (OpenAI, Anthropic, open-source models), vector databases, cloud platforms, CI/CD pipelines, and monitoring systems. A tool with narrow integrations forces you to write custom adapters, increasing your maintenance burden.
2.5 Documentation Quality
Documentation quality directly affects developer productivity. Look for comprehensive API references, working code examples (not just snippets), architecture guides, and migration guides for version upgrades. Poor documentation turns a powerful tool into a frustrating one.
2.6 Performance and Scalability
For production workloads, benchmark the tool under realistic conditions. Orchestration frameworks add latency overhead per call. Serving engines differ dramatically in throughput. Evaluation tools may struggle with large test suites. Profile before committing.
2.7 Learning Curve
Some tools prioritize power and flexibility at the cost of complexity; others optimize for simplicity at the cost of expressiveness. Match the tool's complexity to your team's experience level. A small team building a prototype benefits from a simple tool even if it lacks advanced features.
2.8 Vendor Lock-In Risk
Assess how difficult it would be to migrate away from a tool if you outgrow it or if the project is abandoned. Tools that use standard interfaces (OpenAI-compatible APIs, for example) reduce lock-in. Proprietary abstractions that wrap every LLM call in tool-specific classes increase it.
No single tool scores highest across all eight criteria. The goal is not to find the "best" tool but to find the tool whose strengths align with your project's priorities and whose weaknesses fall in areas you can tolerate. A startup building a prototype has different priorities (learning curve, speed to MVP) than an enterprise deploying to millions of users (stability, support, compliance).
3. The Tool Selection Decision Framework
The following framework guides you through tool selection in four steps. It applies to any of the six categories and can be completed in 30 to 60 minutes per category.
Step 1: Define Your Constraints
Before evaluating tools, list your hard constraints. These are non-negotiable requirements that immediately disqualify tools that fail to meet them. Common constraints include:
- Language requirement: Must support Python, TypeScript, or both
- Licensing: Must be Apache 2.0 or MIT (no AGPL, no proprietary-only)
- Self-hosted: Must run entirely on your infrastructure (no cloud dependencies)
- Model support: Must work with your chosen LLM provider(s)
- Compliance: Must meet SOC 2, HIPAA, or GDPR requirements
Step 2: Weight Your Criteria
Assign weights (1 to 5) to each of the eight selection criteria based on your project's priorities. A research prototype might weight learning curve at 5 and enterprise support at 1. A healthcare production system might weight compliance at 5 and community size at 2.
Step 3: Score Candidates
For each candidate tool, assign a score (1 to 5) on each criterion. Multiply scores by weights and sum to produce a weighted total. The following table illustrates this process for a hypothetical orchestration tool selection.
| Criterion | Weight | LangChain | LlamaIndex | Haystack | DSPy |
|---|---|---|---|---|---|
| Maturity | 4 | 4 (16) | 4 (16) | 5 (20) | 3 (12) |
| Community | 3 | 5 (15) | 4 (12) | 3 (9) | 3 (9) |
| Enterprise Support | 2 | 4 (8) | 4 (8) | 4 (8) | 1 (2) |
| Integration Breadth | 5 | 5 (25) | 4 (20) | 4 (20) | 2 (10) |
| Documentation | 4 | 4 (16) | 4 (16) | 4 (16) | 3 (12) |
| Performance | 3 | 3 (9) | 3 (9) | 4 (12) | 4 (12) |
| Learning Curve | 4 | 3 (12) | 3 (12) | 4 (16) | 2 (8) |
| Lock-In Risk | 3 | 3 (9) | 3 (9) | 4 (12) | 4 (12) |
| Weighted Total | 110 | 102 | 113 | 77 |
Step 4: Validate with a Spike
Numbers on a spreadsheet cannot capture every nuance. After scoring narrows the field to two or three finalists, build a small proof-of-concept (a "spike") with each. Implement the same representative use case in each tool and evaluate developer experience, debugging ease, and unexpected friction. This step typically takes one to two days per tool and prevents costly mid-project migrations.
The LLM tooling landscape evolves rapidly. Tools that scored poorly in 2024 may have improved dramatically by 2026. Revisit your evaluation annually, especially when starting new projects. The criteria and framework remain stable even as the tools themselves change.
4. Mapping Tools to Application Architectures
Different application architectures demand different tool combinations. The following table maps common LLM application patterns to recommended tool stacks. Each row represents a complete, coherent set of tools that work well together.
| Application Pattern | Orchestration | Agents | Evaluation | Observability | Serving |
|---|---|---|---|---|---|
| Simple chatbot | LangChain or direct API | None needed | promptfoo | Langfuse | Cloud API |
| RAG application | LlamaIndex | None needed | RAGAS + promptfoo | LangSmith or Langfuse | Cloud API or vLLM |
| Multi-agent system | LangChain | LangGraph or CrewAI | promptfoo + custom | LangSmith | vLLM or SGLang |
| Research prototype | DSPy | smolagents | Eleuther Eval Harness | W&B Weave | Ollama (local) |
| Enterprise production | Haystack or LangChain | Semantic Kernel or LangGraph | promptfoo + LangSmith | LangSmith + Datadog | vLLM + TGI |
| Fine-tuned domain model | LlamaIndex or Haystack | Optional | RAGAS + domain evals | Phoenix or Langfuse | Unsloth + vLLM |
5. The Build-vs-Adopt Spectrum
Not every layer of your stack needs a third-party tool. For some components, writing your own implementation is simpler and more maintainable than adopting a framework. The build-vs-adopt decision depends on the complexity of your requirements and the overhead the tool introduces.
As a general guideline: use tools for hard problems (serving optimization, distributed tracing, evaluation metrics) and consider building your own for simple problems (basic prompt chaining, straightforward API calls). If your "orchestration" is a single LLM call with a system prompt, LangChain adds complexity without proportional benefit. If your orchestration involves branching logic, retries, streaming, tool calls, and multi-model routing, a framework saves weeks of engineering.
A team building a customer support chatbot initially used raw OpenAI API calls with a simple Python wrapper. When they added RAG, they adopted LlamaIndex for retrieval. When they added multi-turn conversation with tool use, they added LangGraph for state management. When they needed to compare prompt variants across 500 test cases, they adopted promptfoo. Each tool was added at the point where the build-it-yourself approach became more expensive than the adoption cost. This incremental approach avoided both premature abstraction and reinventing solved problems.
6. Reading the Rest of This Appendix
The remaining sections apply the framework established here to specific tool categories. Section V.2 compares orchestration frameworks. Section V.3 covers agent frameworks. Section V.4 examines evaluation and observability tools. Section V.5 addresses serving and fine-tuning tools. Each section includes detailed comparison tables, decision guides, and concrete recommendations for different project types.
Use the decision framework from this section to interpret the comparisons that follow. Your project's weights will differ from the examples, so focus on the criteria scores rather than the final rankings. The best tool is the one that fits your specific context, not the one with the most GitHub stars.