Efficient Multi-Tool Orchestration and Tool Economy

Section 27.6

"The best tool call is the one you did not make, because the answer was already cached from last time."

FrontierFrontier, Cache Savvy AI Agent
Big Picture

Hand an agent 50 tools and you have a budget problem. Every tool call burns tokens (the schema, the request, the response), adds API latency, and can ring up direct charges (paid APIs, compute, database reads). A 50-tool agent that picks badly is slower, costlier, and flakier than one that picks well. This section treats tool use as a cost-optimization problem: cut spend, keep completion rate. Section 27.6a continues with production design patterns (tool tiers, speculative execution, composition, budget-constrained execution), the open research frontier, and a lab combining TransformerLens with DSPy.

Prerequisites

This section builds on the tool-use and function-calling patterns from Section 26.2 and Section 26.6. Familiarity with API design and basic cost optimization is assumed.

27.6.1 The Cost Anatomy of a Tool Call

Fun Fact

Every additional tool you wire into an agent inflates the system prompt by roughly the size of its JSON schema, and most production schemas weigh 200 to 800 tokens each. Teams routinely discover that their 32-tool agent is paying for a 20,000-token prompt on every call, with the model using exactly two of those tools 90% of the time.

Every tool call costs you on three axes: tokens, latency, dollars. A tool-economy architecture turns these cost categories into a control loop with four moving parts:

  1. Plan. A meta-planner LLM decomposes the user request into tool-shaped subgoals.
  2. Route. A tool router scores candidate tools (search, compute, paid APIs, specialist sub-agents) on cost, latency, and reliability against a tool registry.
  3. Budget. The router consults a budget tracker and a result cache before committing.
  4. Dispatch. An executor runs the cheapest viable plan, and a synthesizer returns the final response.
The tool-economy control loop: meta-planner emits subgoals, router scores tools against a registry while consulting a budget tracker and result cache, then the executor fans out the cheapest viable plan.
Figure 27.6.1: The tool-economy control loop: meta-planner emits subgoals, router scores tools against a registry while consulting a budget tracker and result cache, then the executor fans out the cheapest viable plan.

Token Costs

Tokens are the cost that hides in plain sight. Tool schemas ride along in the system prompt on every call. A typical schema with three parameters runs 100 to 300 tokens. Stack 50 of them and you spend 5,000 to 15,000 tokens before the user types a word. Each round-trip adds 200 to 2,000 more tokens (reasoning + structured call + response + interpretation).

Latency & Monetary Costs

Each tool call introduces a round-trip to an external service. For local tools (file reads, calculations), this adds milliseconds. For remote APIs, this adds hundreds of milliseconds to seconds. In a multi-step agent loop, sequential calls accumulate latency that can make the system feel unresponsive. Some tools have direct per-call charges (search APIs per query, database services per read, cloud functions per invocation). At scale, these costs compound rapidly.

We can formalize the total cost of a tool call as:

$$C_{\text{total}} = C_{\text{tokens}} + C_{\text{latency}} + C_{\text{api}}$$ $$C_{\text{tokens}} = (T_{\text{schema}} + T_{\text{request}} + T_{\text{response}} + T_{\text{interpretation}}) \times P_{\text{per\_token}}$$

where $T$ represents token counts and $P_{\text{per\_token}}$ is the price per token for the model being used.

27.6.2 Token-Efficient Tool Calling Patterns

Several engineering patterns reduce the token cost of tool use without sacrificing capability.

Dynamic Tool Loading

Rather than including all tool schemas in every request, dynamically select a subset of tools relevant to the current task. This can be done with a lightweight classifier that maps the user's query to a set of relevant tools, or by using semantic similarity between the query and tool descriptions.

from dataclasses import dataclass

@dataclass
class ToolDefinition:
    """A tool available to the agent."""
    name: str
    description: str
    schema: dict
    token_cost: int

class ToolRouter:
    """Pick top-k tools by keyword overlap; in production use embedding similarity."""
    def __init__(self, tools, max_tools=8, always_include=None):
        self.tools = {t.name: t for t in tools}
        self.max_tools, self.always_include = max_tools, set(always_include or [])
        self._kw = {t.name: set(t.description.lower().split()) for t in tools}

    def select_tools(self, query):
        qw = set(query.lower().split())
        selected = [self.tools[n] for n in self.always_include if n in self.tools]
        cands = sorted(
            ((len(qw & kw), n) for n, kw in self._kw.items()
             if n not in self.always_include and len(qw & kw) > 0),
            reverse=True,
        )
        for _, n in cands[:self.max_tools - len(selected)]:
            selected.append(self.tools[n])
        return selected
Code Fragment 27.6.1: Dynamic tool router. Reduces schema token overhead by 70-90% by selecting top-k relevant tools per query, with always-include tools (e.g., respond_to_user) bypassing the filter.

Compressed Tool Schemas & Batching

Tool descriptions are often verbose. Compressing descriptions (removing redundant words, using abbreviations, omitting obvious parameters) can reduce schema tokens by 30 to 50% without affecting the model's ability to select and use tools correctly. Frontier models are robust to compressed descriptions as long as core semantics are preserved. Tool call batching is the other big win: when an agent needs to make multiple independent tool calls, batching them into a single step (parallel tool calling) reduces the number of round-trips. Most frontier model APIs support parallel function calling.

27.6.3 Tool Caching Strategies

Many tool calls are repetitive within and across conversations. Caching tool results can eliminate redundant calls, reducing cost and latency.

Result Caching

The simplest caching strategy stores tool results keyed by tool name and arguments (SHA-256 hash). If the same tool is called with the same arguments within a configurable time window, the cached result is returned without making the actual call. This is effective for tools with deterministic or slowly-changing outputs.

Semantic Caching

For search-like tools, semantic caching stores results keyed by the semantic meaning of the query rather than the exact query string. If a user asks "What is the weather in Paris?" and then "Paris weather today?", semantic caching recognizes these as equivalent queries. Requires embedding the query and checking cosine similarity against cached queries with a configurable threshold.

import hashlib, json, time
from dataclasses import dataclass

@dataclass
class CacheEntry:
    result: str
    timestamp: float
    ttl_seconds: float
    @property
    def is_expired(self):
        return (time.time() - self.timestamp) > self.ttl_seconds

class ToolCache:
    """Exact-match cache for tool results with TTL eviction."""
    def __init__(self, default_ttl=300.0):
        self.default_ttl, self._cache = default_ttl, {}
        self._stats = {"hits": 0, "misses": 0}

    def _key(self, name, args):
        return hashlib.sha256(f"{name}:{json.dumps(args, sort_keys=True)}".encode()).hexdigest()

    def get(self, name, args):
        entry = self._cache.get(self._key(name, args))
        if entry and not entry.is_expired:
            self._stats["hits"] += 1
            return entry.result
        self._stats["misses"] += 1
        return None

    def put(self, name, args, result, ttl=None):
        self._cache[self._key(name, args)] = CacheEntry(result, time.time(), ttl or self.default_ttl)

    @property
    def hit_rate(self):
        tot = self._stats["hits"] + self._stats["misses"]
        return self._stats["hits"] / tot if tot else 0.0
Code Fragment 27.6.2: Tool result caching with TTL eviction. Add a semantic-similarity layer for search-like queries by embedding the args and falling back to cosine similarity against cached embeddings.

27.6.4 Parallel Execution and Dependency Graphs

When an agent needs to make multiple tool calls, the optimal execution strategy depends on the dependency structure between calls. Independent calls can be executed in parallel; dependent calls must be sequential. A dependency graph for tool calls is a directed acyclic graph (DAG) where each node is a tool call and edges represent data dependencies. The optimal execution schedule is the longest path through the DAG (the critical path); all other paths can execute concurrently.

For example, consider a research agent that needs to: (1) search the web for "transformer alternatives 2025," (2) search the web for "state space models benchmarks," (3) read the top result from search 1, (4) read the top result from search 2, and (5) synthesize both readings. Calls 1 and 2 are independent (parallel); 3 and 4 depend on 1 and 2 respectively; 5 depends on both 3 and 4. The critical path is three steps, not five. Frontier model APIs increasingly support parallel function calling, where the model generates multiple tool calls in a single response.

27.6.5 Economic Models for Tool Use

Treating tool use as an economic decision means that each tool call should be evaluated against its expected benefit. For a given tool call, the expected net value is:

$$V_{\text{net}} = P(\text{useful}) \times \Delta Q - C_{\text{total}}$$

where $P(\text{useful})$ is the probability the tool returns information that improves the answer, $\Delta Q$ is the quality improvement, and $C_{\text{total}}$ is the total cost. A rational agent should make the call if and only if $V_{\text{net}} > 0$. When multiple tools could provide the needed information, prefer the cheapest tool that meets the quality threshold. If the agent needs current stock prices, a cached database lookup ($0.001, 10ms) is preferable to a web search ($0.01, 500ms), even if the web search might return slightly more recent data.

27.6.6 Benchmarking Tool Efficiency

Several benchmarks evaluate LLM tool use, though most focus on capability rather than efficiency. ToolBench (Qin et al., 2024) covers 16,000+ APIs but focuses on capability. API-Bank (Li et al., 2023) tests 73 APIs in a conversational setting and includes metrics for unnecessary API calls. T-Bench (Wang et al., 2024) emphasizes when to use tools versus internal knowledge and includes an efficiency score penalizing unnecessary calls. BFCL focuses on function-call accuracy without efficiency.

A comprehensive tool efficiency evaluation should include: task completion rate, tool calls per task, tokens per task, unnecessary call rate (calls that did not contribute to the final answer), parallel utilization (fraction of independent calls batched), and cache hit rate.

Key Insight

Tool efficiency is not just a cost optimization; it is a reliability optimization. Every unnecessary tool call is an opportunity for failure: the API might be slow, return an error, or return stale data. Reducing the number of tool calls reduces the surface area for failures. The most efficient agent is not the one that calls the most tools, but the one that calls the fewest tools necessary to complete the task reliably. This parallels the software engineering principle of minimizing external dependencies.

Key Takeaways
Self-Check
Q1: Why does sending all 50 tool schemas on every request usually dominate the token bill?
Show Answer

Each schema is 200 to 800 tokens, multiplied by every request. A 50-tool agent pays 10,000 to 40,000 schema tokens per call even when the model needs only one or two tools. Dynamic tool loading prunes the set to a handful of relevant tools, reducing schema tokens by 70 to 90%.

Q2: When does semantic caching help, and when does it hurt?
Show Answer

Helps when similar queries produce identical results (FAQ-style questions, definitions). Hurts when similar queries hide different arguments that drive different results (e.g., "status of order ORD-123" vs "status of order ORD-124" embed nearly identically). For order-status queries, use exact-match cache only.

Exercise 27.6.1: Tool Cost Analysis

An agent has 50 tools available. Each tool schema costs approximately 200 tokens. The agent processes 10,000 queries per day, and the LLM charges $3 per million input tokens.

  1. What is the daily cost of including all 50 tool schemas in every request?
  2. If a dynamic tool router reduces the average schema size to 6 tools per request, what are the daily savings?
  3. Under what conditions would the overhead of the routing step itself outweigh the savings?
Show Answer

1. 50 tools * 200 tokens = 10,000 schema tokens per request. 10,000 requests/day * 10,000 tokens = 100 million tokens/day. Cost: 100M * $3/1M = $300/day. 2. With routing: 6 * 200 = 1,200 tokens per request. 10,000 * 1,200 = 12 million tokens/day. Cost: $36/day. Savings: $264/day, or 88%. 3. If the router itself requires an LLM call (embedding + similarity), the routing cost must stay below the savings. At very low query volumes the engineering overhead of maintaining the router may not justify the token savings.

What Comes Next

This section continues in Section 27.6a: Tool Orchestration Patterns & Interpretability-Reasoning Lab, which covers the four production design patterns (tool tiers, speculative execution, tool composition, budget-constrained execution), the open research frontier (MCP, prompt injection through tool outputs), and a hands-on lab combining TransformerLens activation patching with DSPy structured reasoning.

Further Reading

Foundational Tool Use

Schick, T., Dwivedi-Yu, J., Dessì, R., et al. (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools." NeurIPS 2023. The seminal paper showing that language models can learn when and how to call external tools through self-supervised training.
Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E. (2023). "Gorilla: Large Language Model Connected with Massive APIs." arXiv:2305.15334. Fine-tunes an LLM to accurately generate API calls from documentation, dramatically reducing hallucination in tool invocations.

Tool Benchmarks

Qin, Y., Liang, S., Ye, Y., et al. (2024). "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-World APIs." ICLR 2024. Constructs a dataset and framework for training models to use over 16,000 real-world APIs.
Li, M., Song, F., Yu, B., et al. (2023). "API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs." EMNLP 2023. A systematic benchmark for evaluating LLM tool use across planning, retrieval, and calling stages.