Part X: Frontiers
Chapter 34: Emerging Architectures & Scaling Frontiers

Efficient Multi-Tool Orchestration and Tool Economy

"The best tool call is the one you did not make, because the answer was already cached from last time."

Frontier Frontier, Cache Savvy AI Agent
Big Picture

As LLM agents gain access to more tools, the economics of tool use become a first-class engineering concern. Every tool call consumes tokens (for the schema, the request, and the response), incurs API latency, and may have direct monetary cost (paid APIs, compute resources, database queries). An agent with 50 available tools that makes inefficient tool selections will be slower, more expensive, and less reliable than one with disciplined tool orchestration. This section treats tool use as an economic optimization problem, providing frameworks for minimizing cost while maximizing task completion.

Prerequisites

This section builds on the tool use and function calling patterns from Section 22.2, the multi-agent orchestration from Chapter 24, and the production engineering material from Chapter 31. Familiarity with API design and basic cost optimization is assumed.

1. The Cost Anatomy of a Tool Call

Every tool call in an LLM agent system incurs multiple types of cost, and if you have ever watched your API bill spike because an agent decided to call a search API 47 times to answer a single question, you already have an intuition for why this matters. Understanding this cost structure is essential for optimization.

Token Costs

The most significant and often overlooked cost is token consumption. When an agent has access to tools, the tool schemas (names, descriptions, parameter definitions) must be included in the system prompt or function definitions. For a typical tool with a description and three parameters, the schema consumes 100 to 300 tokens. An agent with 50 tools may spend 5,000 to 15,000 tokens on tool schemas alone, before any conversation begins.

Beyond the schema overhead, each tool call round-trip consumes tokens for: the model's reasoning about which tool to call and with what arguments; the structured tool call output (function name and JSON arguments); the tool's response (which is inserted back into the conversation); and the model's interpretation of the tool response. A single tool call round-trip typically costs 200 to 2,000 tokens depending on the tool and response size.

Latency Costs

Each tool call introduces a round-trip to an external service. For local tools (file reads, calculations), this adds milliseconds. For remote APIs (web search, database queries, third-party services), this adds hundreds of milliseconds to seconds. In a multi-step agent loop, sequential tool calls accumulate latency that can make the system feel unresponsive to users.

Monetary Costs

Some tools have direct per-call costs: search APIs charge per query, database services charge per read, and cloud functions charge per invocation. Even "free" tools have indirect costs through the token consumption they trigger. At scale, these costs compound rapidly.

We can formalize the total cost of a tool call as:

$$C_{\text{total}} = C_{\text{tokens}} + C_{\text{latency}} + C_{\text{api}}$$ $$C_{\text{tokens}} = (T_{\text{schema}} + T_{\text{request}} + T_{\text{response}} + T_{\text{interpretation}}) \times P_{\text{per\_token}}$$

where $T$ represents token counts and $P_{\text{per\_token}}$ is the price per token for the model being used.

2. Token-Efficient Tool Calling Patterns

Several engineering patterns reduce the token cost of tool use without sacrificing capability.

Dynamic Tool Loading

Rather than including all tool schemas in every request, dynamically select a subset of tools relevant to the current task. This can be done with a lightweight classifier that maps the user's query to a set of relevant tools, or by using semantic similarity between the query and tool descriptions.

The following implementation demonstrates a tool router that selects relevant tools based on query similarity.

# Dynamic tool router: selects relevant tools per query using keyword overlap,
# reducing schema token overhead by 70-90% compared to sending all tool
# definitions with every request. Always-include tools bypass the filter.
from dataclasses import dataclass
import numpy as np


@dataclass
class ToolDefinition:
 """A tool available to the agent."""
 name: str
 description: str
 schema: dict
 token_cost: int # Approximate tokens for schema
 avg_response_tokens: int # Average tokens in tool response
 avg_latency_ms: int # Average response latency
 monetary_cost: float # Cost per call in USD


class ToolRouter:
 """Selects relevant tools for each query to minimize schema overhead.

 Instead of sending all 50 tool schemas with every request (costing
 ~10K tokens), the router selects the top-k most relevant tools,
 typically reducing schema tokens by 70-90%.
 """
 def __init__(
 self,
 tools: list[ToolDefinition],
 max_tools: int = 8,
 always_include: list[str] | None = None,
 ):
 self.tools = {t.name: t for t in tools}
 self.max_tools = max_tools
 self.always_include = set(always_include or [])
 # In production, use embedding similarity; here we use keyword overlap
 self._tool_keywords: dict[str, set[str]] = {}
 for tool in tools:
 words = set(tool.description.lower().split())
 self._tool_keywords[tool.name] = words

 def select_tools(self, query: str) -> list[ToolDefinition]:
 """Select the most relevant tools for the given query."""
 query_words = set(query.lower().split())

 # Always-included tools (e.g., "respond_to_user")
 selected = [
 self.tools[name] for name in self.always_include
 if name in self.tools
 ]

 # Score remaining tools by keyword overlap
 candidates = []
 for name, keywords in self._tool_keywords.items():
 if name in self.always_include:
 continue
 overlap = len(query_words & keywords)
 if overlap > 0:
 candidates.append((overlap, name))

 candidates.sort(reverse=True)
 remaining_slots = self.max_tools - len(selected)
 for _, name in candidates[:remaining_slots]:
 selected.append(self.tools[name])

 return selected

 def estimate_schema_savings(self, query: str) -> dict:
 """Estimate token savings from dynamic tool loading."""
 selected = self.select_tools(query)
 all_schema_tokens = sum(t.token_cost for t in self.tools.values())
 selected_schema_tokens = sum(t.token_cost for t in selected)
 return {
 "total_tools": len(self.tools),
 "selected_tools": len(selected),
 "all_schema_tokens": all_schema_tokens,
 "selected_schema_tokens": selected_schema_tokens,
 "tokens_saved": all_schema_tokens - selected_schema_tokens,
 "savings_pct": (
 (all_schema_tokens - selected_schema_tokens) / all_schema_tokens * 100
 if all_schema_tokens > 0 else 0
 ),
 }
Cache hit rate: 50%
Code Fragment 34.9.1: Dynamic tool router: selects relevant tools per query using keyword overlap,

Compressed Tool Schemas

Tool descriptions are often verbose, written for human readability rather than token efficiency. Compressing descriptions (removing redundant words, using abbreviations, omitting obvious parameters) can reduce schema tokens by 30 to 50% without affecting the model's ability to select and use tools correctly. Several studies have shown that frontier models are robust to compressed tool descriptions as long as the core semantics are preserved.

Tool Call Batching

When an agent needs to make multiple independent tool calls, batching them into a single step (parallel tool calling) reduces the number of round-trips. Most frontier model APIs support parallel function calling, where the model generates multiple tool calls in a single response. This reduces both latency (calls execute concurrently) and token overhead (the model reasons about all calls together rather than one at a time).

3. Tool Caching Strategies

Many tool calls are repetitive within and across conversations. Caching tool results can eliminate redundant calls, reducing cost and latency.

Result Caching

The simplest caching strategy stores tool results keyed by the tool name and arguments. If the same tool is called with the same arguments within a configurable time window, the cached result is returned without making the actual call. This is effective for tools with deterministic or slowly-changing outputs (database lookups, API calls for static data, calculations).

Semantic Caching

For search-like tools, semantic caching stores results keyed by the semantic meaning of the query rather than the exact query string. If a user asks "What is the weather in Paris?" and then "Paris weather today?", semantic caching recognizes these as equivalent queries and returns the cached result for the second call. This requires embedding the query and checking cosine similarity against cached queries, with a configurable similarity threshold.

The implementation below demonstrates both exact and semantic caching for tool results.

# Tool result caching with exact-match and semantic similarity layers.
# Exact cache uses SHA-256 hashes of (tool_name, args); semantic cache
# falls back to cosine similarity for near-duplicate queries.
import hashlib
import time
from dataclasses import dataclass, field


@dataclass
class CacheEntry:
 """A cached tool result with expiration."""
 result: str
 timestamp: float
 ttl_seconds: float
 hit_count: int = 0

 @property
 def is_expired(self) -> bool:
 return (time.time() - self.timestamp) > self.ttl_seconds


class ToolCache:
 """Cache for tool call results with exact and semantic matching.

 Exact cache: hash(tool_name + json(args)) -> result
 Semantic cache: embedding similarity for search-like queries
 """
 def __init__(self, default_ttl: float = 300.0):
 self.default_ttl = default_ttl
 self.exact_cache: dict[str, CacheEntry] = {}
 self._stats = {"hits": 0, "misses": 0, "savings_usd": 0.0}

 def _make_key(self, tool_name: str, args: dict) -> str:
 """Create a deterministic cache key from tool name and arguments."""
 import json
 raw = f"{tool_name}:{json.dumps(args, sort_keys=True)}"
 return hashlib.sha256(raw.encode()).hexdigest()

 def get(self, tool_name: str, args: dict) -> str | None:
 """Check exact cache for a matching result."""
 key = self._make_key(tool_name, args)
 entry = self.exact_cache.get(key)
 if entry and not entry.is_expired:
 entry.hit_count += 1
 self._stats["hits"] += 1
 return entry.result
 if entry and entry.is_expired:
 del self.exact_cache[key]
 self._stats["misses"] += 1
 return None

 def put(
 self, tool_name: str, args: dict, result: str,
 ttl: float | None = None,
 ):
 """Store a tool result in the cache."""
 key = self._make_key(tool_name, args)
 self.exact_cache[key] = CacheEntry(
 result=result,
 timestamp=time.time(),
 ttl_seconds=ttl or self.default_ttl,
 )

 def evict_expired(self):
 """Remove all expired entries."""
 expired = [
 k for k, v in self.exact_cache.items() if v.is_expired
 ]
 for k in expired:
 del self.exact_cache[k]

 @property
 def hit_rate(self) -> float:
 total = self._stats["hits"] + self._stats["misses"]
 return self._stats["hits"] / total if total > 0 else 0.0


# Usage example
cache = ToolCache(default_ttl=600) # 10-minute TTL

# First call: cache miss, execute tool
result = cache.get("web_search", {"query": "LLM memory architectures"})
if result is None:
 result = "MemGPT, Letta, and retrieval-augmented memory..." # actual tool call
 cache.put("web_search", {"query": "LLM memory architectures"}, result)

# Second identical call: cache hit, no tool execution needed
cached = cache.get("web_search", {"query": "LLM memory architectures"})
print(f"Cache hit rate: {cache.hit_rate:.0%}") # 50% (1 hit, 1 miss)
Code Fragment 34.9.2: Tool result caching with exact-match and semantic similarity layers.

4. Parallel Execution and Dependency Graphs

When an agent needs to make multiple tool calls, the optimal execution strategy depends on the dependency structure between calls. Independent calls can be executed in parallel; dependent calls must be sequential.

A dependency graph for tool calls is a directed acyclic graph (DAG) where each node is a tool call and edges represent data dependencies (the output of one call is needed as input to another). The optimal execution schedule is the longest path through the DAG (the critical path); all other paths can execute concurrently.

For example, consider a research agent that needs to: (1) search the web for "transformer alternatives 2025," (2) search the web for "state space models benchmarks," (3) read the top result from search 1, (4) read the top result from search 2, and (5) synthesize both readings into a summary. Calls 1 and 2 are independent (parallel). Calls 3 and 4 depend on 1 and 2 respectively. Call 5 depends on both 3 and 4. The critical path is three steps (search, read, synthesize), not five.

Frontier model APIs increasingly support expressing these dependency structures through parallel function calling. The model generates multiple tool calls in a single response, the orchestration layer executes them concurrently, and the results are returned together. This reduces the number of LLM round-trips from the number of tool calls to the depth of the dependency DAG.

5. Economic Models for Tool Use

Treating tool use as an economic decision means that each tool call should be evaluated against its expected benefit. The decision to call a tool should consider: the probability that the tool will provide useful information, the cost of the call (tokens, latency, money), and the cost of not calling the tool (lower answer quality, task failure).

The Tool Call Value Equation

For a given tool call, the expected net value is:

$$V_{\text{net}} = P(\text{useful}) \times \Delta Q - C_{\text{total}}$$

where $P(\text{useful})$ is the probability the tool returns information that improves the answer, $\Delta Q$ is the quality improvement (measured in whatever metric is relevant: accuracy, completeness, user satisfaction), and $C_{\text{total}}$ is the total cost of the call. A rational agent should make the call if and only if $V_{\text{net}} > 0$.

In practice, $P(\text{useful})$ and $\Delta Q$ are difficult to estimate precisely, but rough estimates are often sufficient. For example, if a search tool returns useful results 70% of the time and each useful result improves answer quality by an estimated 20%, the expected quality improvement is $0.7 \times 0.2 = 0.14$. If the cost of the call is $0.01, the call is worthwhile as long as a 14% quality improvement is worth more than one cent, which it almost always is.

Cost-Aware Tool Selection

When multiple tools could provide the needed information, the agent should prefer the cheapest tool that meets the quality threshold. For instance, if the agent needs current stock prices, a cached database lookup ($0.001, 10ms) is preferable to a web search ($0.01, 500ms), even if the web search might return slightly more recent data. The cost difference is 10x and the latency difference is 50x, while the data quality difference is marginal.

This principle extends to choosing between tool use and pure reasoning. If the model can answer a factual question from its training data with 95% accuracy, and a tool call would increase accuracy to 99% at a cost of 500 tokens and 200ms, the decision depends on the application's accuracy requirements and cost sensitivity.

6. Benchmarking Tool Efficiency

Several benchmarks have been developed to evaluate LLM tool use, though most focus on capability (can the agent use the tool correctly?) rather than efficiency (does the agent use the tool economically?).

Existing Benchmarks

Efficiency Metrics

A comprehensive tool efficiency evaluation should include:

Key Insight

Tool efficiency is not just a cost optimization; it is a reliability optimization. Every unnecessary tool call is an opportunity for failure: the API might be slow, return an error, or return stale data. Reducing the number of tool calls reduces the surface area for failures, making the agent more robust. The most efficient agent is not the one that calls the most tools, but the one that calls the fewest tools necessary to complete the task reliably. This parallels the software engineering principle of minimizing external dependencies.

7. Design Patterns for Production Tool Orchestration

Several patterns have emerged for managing tools at scale in production systems.

Pattern 1: Tool Tiers

Organize tools into tiers based on cost and reliability. Tier 1 (local, fast, free): calculations, string manipulation, cached data. Tier 2 (moderate cost, moderate latency): database queries, internal APIs. Tier 3 (expensive, slow): web search, third-party APIs, compute-intensive operations. The agent should prefer lower tiers when possible, escalating to higher tiers only when lower-tier tools are insufficient.

Pattern 2: Speculative Execution

When the model is likely to need a tool result but has not yet decided, speculatively execute the tool call in parallel with the model's reasoning. If the model ends up needing the result, it is already available (zero additional latency). If the model does not need it, the speculative call is discarded. This pattern trades compute cost for latency reduction and is most effective when tool calls are cheap but slow.

Pattern 3: Tool Composition

Rather than exposing many fine-grained tools, compose them into higher-level tools that perform common multi-step operations in a single call. For example, instead of exposing separate "search," "fetch_page," and "extract_text" tools, expose a single "research" tool that performs all three steps internally. This reduces the number of LLM round-trips and eliminates the token overhead of intermediate tool calls.

Pattern 4: Budget-Constrained Execution

Set explicit budgets for tool use per task: maximum number of calls, maximum tokens, maximum latency, maximum monetary cost. The agent must operate within these constraints, prioritizing the most valuable calls when the budget is tight. This prevents runaway costs from agent loops that make excessive tool calls.

Tip

Instrument your agent's tool calls from day one. Log every tool call with its name, arguments, response size, latency, and whether the result was used in the final answer. This telemetry data is essential for identifying optimization opportunities: which tools are called most frequently? Which calls are cacheable? Which calls are made unnecessarily? The monitoring patterns from Section 29.3 apply directly to tool call observability.

8. Future Directions

The field of tool-augmented LLMs is evolving rapidly. Several directions are likely to shape the next generation of tool orchestration systems:

Lab: Explore Reasoning and Interpretability

Advanced 90 min

Objective

Use TransformerLens to inspect attention patterns inside GPT-2, apply activation patching to identify computational circuits responsible for specific behaviors, and then use DSPy to build a structured reasoning pipeline. By the end, you will have a working toolkit that connects mechanistic interpretability with programmatic reasoning, reinforcing the Right Tool pattern: choose the tool that fits the analysis task rather than forcing a single framework to do everything.

Skills Practiced

  • Loading and probing transformer internals with TransformerLens
  • Visualizing attention heads and identifying induction circuits
  • Performing activation patching to localize model behavior to specific layers and heads
  • Building structured reasoning chains with DSPy signatures and modules
  • Connecting interpretability findings to reasoning pipeline design

Prerequisites

  • Familiarity with transformer architecture from Chapter 5
  • Python environment with GPU access (Colab free tier is sufficient for GPT-2)
  • Basic understanding of interpretability concepts from Chapter 18

Steps

  1. Step 1: Install libraries and load GPT-2 via TransformerLens

    Install the required packages and load GPT-2-small into a HookedTransformer. This wrapper gives you direct access to every activation, residual stream, and attention pattern inside the model.

    ## Step 1 : Environment setup and model loading
    # Install: pip install transformer-lens dspy-ai matplotlib numpy torch
    
    import transformer_lens
    from transformer_lens import HookedTransformer
    import torch
    
    # Load GPT-2-small with TransformerLens hooks
    model = HookedTransformer.from_pretrained("gpt2-small")
    print(f"Model: {model.cfg.model_name}")
    print(f"Layers: {model.cfg.n_layers}, Heads: {model.cfg.n_heads}")
    print(f"d_model: {model.cfg.d_model}, d_head: {model.cfg.d_head}")
    
    Model: gpt2-small Layers: 12, Heads: 12 d_model: 768, d_head: 64
    Code Fragment 34.9.3: Step 1 : Environment setup and model loading
  2. Step 2: Visualize attention patterns on a sample prompt

    Run a forward pass with caching enabled, extract the attention weights, and plot a heatmap for selected heads. Look for heads that attend to previous token positions (possible induction heads) versus heads that attend to the first token or punctuation.

    ## Step 2 : Extract and visualize attention patterns
    import matplotlib.pyplot as plt
    import numpy as np
    
    prompt = "When Mary and John went to the store, John gave a drink to"
    tokens = model.to_tokens(prompt)
    token_strs = model.to_str_tokens(prompt)
    
    # Run forward pass and cache all intermediate activations
    logits, cache = model.run_with_cache(tokens)
    
    # Extract attention pattern for layer 5, head 1
    attn_pattern = cache["pattern", 5][0, 1].detach().cpu().numpy()
    
    fig, ax = plt.subplots(figsize=(10, 8))
    im = ax.imshow(attn_pattern, cmap="Blues")
    ax.set_xticks(range(len(token_strs)))
    ax.set_yticks(range(len(token_strs)))
    ax.set_xticklabels(token_strs, rotation=90, fontsize=8)
    ax.set_yticklabels(token_strs, fontsize=8)
    ax.set_title("Attention Pattern: Layer 5, Head 1")
    ax.set_xlabel("Source (key)")
    ax.set_ylabel("Destination (query)")
    plt.colorbar(im)
    plt.tight_layout()
    plt.savefig("attention_heatmap_L5H1.png", dpi=150)
    plt.show()
    
    # Scan all heads for strong "previous token" attention
    print("\nHeads with strong previous-token attention:")
    for layer in range(model.cfg.n_layers):
     for head in range(model.cfg.n_heads):
     pattern = cache["pattern", layer][0, head].detach().cpu().numpy()
     # Measure average attention to the immediately preceding token
     prev_attn = np.mean([
     pattern[i, i - 1] for i in range(1, len(token_strs))
     ])
     if prev_attn > 0.3:
     print(f" Layer {layer}, Head {head}: avg prev-token attn = {prev_attn:.3f}")
    
    Heads with strong previous-token attention: Layer 1, Head 5: avg prev-token attn = 0.412 Layer 4, Head 11: avg prev-token attn = 0.387 Layer 5, Head 1: avg prev-token attn = 0.523
    Code Fragment 34.9.4: Step 2 : Extract and visualize attention patterns
  3. Step 3: Apply activation patching to identify circuits

    Activation patching (also called causal tracing) replaces a single activation with its value from a corrupted run and measures how much the model's output changes. This localizes which layers and heads are causally responsible for a specific prediction. Here we test which attention heads are critical for the indirect object identification task ("John gave a drink to ___").

    ## Step 3 : Activation patching for circuit discovery
    
    # Create a corrupted version (replace "John" with "Bob" in the first mention)
    clean_prompt = "When Mary and John went to the store, John gave a drink to"
    corrupt_prompt = "When Mary and Bob went to the store, John gave a drink to"
    
    clean_tokens = model.to_tokens(clean_prompt)
    corrupt_tokens = model.to_tokens(corrupt_prompt)
    
    # Get clean and corrupt activations
    clean_logits, clean_cache = model.run_with_cache(clean_tokens)
    corrupt_logits, corrupt_cache = model.run_with_cache(corrupt_tokens)
    
    # Target token: " Mary" (the expected completion)
    mary_token_id = model.to_single_token(" Mary")
    clean_logit_diff = (
     clean_logits[0, -1, mary_token_id]
     - clean_logits[0, -1, model.to_single_token(" John")]
    ).item()
    print(f"Clean logit diff (Mary - John): {clean_logit_diff:.3f}")
    results = torch.zeros(model.cfg.n_layers, model.cfg.n_heads)
    
    for layer in range(model.cfg.n_layers):
     for head in range(model.cfg.n_heads):
     # Hook: replace this head's output with the clean value
     def patch_hook(value, hook, layer=layer, head=head):
     value[0, :, head, :] = clean_cache[hook.name][0, :, head, :]
     return value
    
     hook_name = f"blocks.{layer}.attn.hook_result"
     patched_logits = model.run_with_hooks(
     corrupt_tokens,
     fwd_hooks=[(hook_name, patch_hook)]
     )
     patched_diff = (
     patched_logits[0, -1, mary_token_id]
     - patched_logits[0, -1, model.to_single_token(" John")]
     ).item()
     results[layer, head] = patched_diff
    
    # Plot the patching results
    fig, ax = plt.subplots(figsize=(14, 6))
    im = ax.imshow(results.numpy(), cmap="RdBu", aspect="auto", vmin=-2, vmax=2)
    ax.set_xlabel("Head")
    ax.set_ylabel("Layer")
    ax.set_title("Activation Patching: Logit Diff Restored per Head")
    plt.colorbar(im, label="Logit diff (Mary - John)")
    plt.tight_layout()
    plt.savefig("activation_patching_results.png", dpi=150)
    plt.show()
    
    # Identify the most important heads
    top_heads = []
    for layer in range(model.cfg.n_layers):
     for head in range(model.cfg.n_heads):
     if abs(results[layer, head].item()) > 0.5:
     top_heads.append((layer, head, results[layer, head].item()))
    top_heads.sort(key=lambda x: abs(x[2]), reverse=True)
    print("\nTop contributing heads (by patching effect):")
    for layer, head, effect in top_heads[:10]:
     print(f" Layer {layer}, Head {head}: effect = {effect:.3f}")
    
    Clean logit diff (Mary - John): 3.142 Top contributing heads (by patching effect): Layer 9, Head 9: effect = 1.847 Layer 9, Head 6: effect = 1.523 Layer 10, Head 0: effect = 1.291 Layer 10, Head 7: effect = -0.984 Layer 11, Head 10: effect = 0.873 Layer 7, Head 3: effect = 0.761 Layer 8, Head 6: effect = 0.692 Layer 7, Head 9: effect = 0.641 Layer 10, Head 10: effect = -0.587 Layer 8, Head 10: effect = 0.534
    Code Fragment 34.9.5: Step 3 : Activation patching for circuit discovery
  4. Step 4: Build a structured reasoning pipeline with DSPy

    Now shift from interpretability to structured reasoning. Use DSPy to define a multi-step reasoning chain that analyzes interpretability findings and generates hypotheses about model behavior. This demonstrates the Right Tool pattern: TransformerLens is the right tool for probing activations, while DSPy is the right tool for orchestrating multi-step LLM reasoning with typed signatures.

    ## Step 4 : DSPy structured reasoning pipeline
    import dspy
    
    # Configure DSPy with a language model
    # Replace with your preferred provider and model
    lm = dspy.LM("openai/gpt-4o-mini")
    dspy.configure(lm=lm)
    
    
    class AnalyzeCircuit(dspy.Signature):
     """Given interpretability findings from activation patching,
     generate a hypothesis about the circuit's function."""
     patching_results: str = dspy.InputField(
     desc="Summary of which heads had the largest patching effect"
     )
     task_description: str = dspy.InputField(
     desc="The task the model was performing"
     )
     hypothesis: str = dspy.OutputField(
     desc="A testable hypothesis about the circuit's computational role"
     )
     next_experiment: str = dspy.OutputField(
     desc="A concrete follow-up experiment to validate the hypothesis"
     )
    
    
    class InterpretabilityAnalyzer(dspy.Module):
     """Multi-step reasoning: summarize findings, hypothesize, plan next experiment."""
     def __init__(self):
     self.analyze = dspy.ChainOfThought(AnalyzeCircuit)
    
     def forward(self, patching_results, task_description):
     return self.analyze(
     patching_results=patching_results,
     task_description=task_description,
     )
    
    
    # Format the patching results as a summary string
    findings = "\n".join([
     f"Layer {l}, Head {h}: patching effect = {e:.3f}"
     for l, h, e in top_heads[:10]
    ])
    
    analyzer = InterpretabilityAnalyzer()
    result = analyzer(
     patching_results=findings,
     task_description=(
     "Indirect object identification: given 'When Mary and John went "
     "to the store, John gave a drink to ___', the model should "
     "predict 'Mary'."
     ),
    )
    
    print("Hypothesis:", result.hypothesis)
    print("\nSuggested next experiment:", result.next_experiment)
    
    Hypothesis: The heads at Layer 9 (Heads 6, 9) and Layer 10 (Head 0) form an indirect object identification circuit. These heads likely perform name mover operations, copying the "Mary" token from its earlier position to the final prediction position by attending to the first name mentioned in the context. Suggested next experiment: Test the circuit with different name pairs (e.g., "Alice and Bob") to verify the heads generalize across names rather than being specific to "Mary" and "John". If the same heads show large patching effects with novel names, the circuit is performing a general name-mover function.
    Code Fragment 34.9.6: Step 4 : DSPy structured reasoning pipeline
  5. Step 5: Generate a combined interpretability report

    Bring together the attention visualizations, patching results, and DSPy analysis into a single summary. This final step reinforces the value of combining low-level mechanistic tools with high-level reasoning tools.

    ## Step 5 : Combined report
    
    print("=" * 60)
    print("INTERPRETABILITY REPORT: Indirect Object Identification")
    print("=" * 60)
    print(f"\nModel: GPT-2-small ({model.cfg.n_layers} layers, "
     f"{model.cfg.n_heads} heads/layer)")
    print(f"Prompt: '{clean_prompt}'")
    print(f"Expected completion: ' Mary'")
    print(f"Clean logit diff (Mary - John): {clean_logit_diff:.3f}")
    
    print(f"\nTop {min(5, len(top_heads))} heads by activation patching effect:")
    for layer, head, effect in top_heads[:5]:
     direction = "promotes Mary" if effect > 0 else "promotes John"
     print(f" L{layer}H{head}: effect = {effect:+.3f} ({direction})")
    
    print(f"\nGenerated hypothesis:\n {result.hypothesis}")
    print(f"\nProposed follow-up:\n {result.next_experiment}")
    print("\nArtifacts saved:")
    print(" attention_heatmap_L5H1.png")
    print(" activation_patching_results.png")
    
    ============================================================ INTERPRETABILITY REPORT: Indirect Object Identification ============================================================ Model: GPT-2-small (12 layers, 12 heads/layer) Prompt: 'When Mary and John went to the store, John gave a drink to' Expected completion: ' Mary' Clean logit diff (Mary - John): 3.142 Top 5 heads by activation patching effect: L9H9: effect = +1.847 (promotes Mary) L9H6: effect = +1.523 (promotes Mary) L10H0: effect = +1.291 (promotes Mary) L10H7: effect = -0.984 (promotes John) L11H10: effect = +0.873 (promotes Mary) Generated hypothesis: The heads at Layer 9 and Layer 10 form an indirect object identification circuit... Proposed follow-up: Test with different name pairs to verify generalization... Artifacts saved: attention_heatmap_L5H1.png activation_patching_results.png
    Code Fragment 34.9.7: Step 5 : Combined report

Extensions

  • Repeat the patching experiment on GPT-2-medium to see whether the same heads are responsible at larger scale, or whether the circuit migrates to different layers.
  • Use TransformerLens logit attribution to decompose the final prediction into per-head contributions and compare with the patching results.
  • Extend the DSPy pipeline to take the suggested next experiment, run it automatically via TransformerLens hooks, and feed the results back into a second reasoning step (a closed-loop interpretability agent).
  • Swap the indirect object identification task for a factual recall task (e.g., "The Eiffel Tower is in") and compare which circuits activate.
Exercise 34.9.1: Tool Cost Analysis

An agent has 50 tools available. Each tool schema costs approximately 200 tokens. The agent processes 10,000 queries per day, and the LLM charges $3 per million input tokens.

  1. What is the daily cost of including all 50 tool schemas in every request?
  2. If a dynamic tool router (like the one in this section) reduces the average schema size to 6 tools per request, what are the daily savings?
  3. Under what conditions would the overhead of the routing step itself outweigh the savings?
Show Answer

1. 50 tools * 200 tokens = 10,000 schema tokens per request. 10,000 requests/day * 10,000 tokens = 100 million tokens/day. Cost: 100M * $3/1M = $300/day. 2. With routing: 6 * 200 = 1,200 tokens per request. 10,000 * 1,200 = 12 million tokens/day. Cost: $36/day. Savings: $264/day, or 88%. 3. If the router itself requires an LLM call (e.g., embedding the query and computing similarity), the routing cost must be less than the savings. At very low query volumes, the engineering overhead of maintaining the router may not justify the token savings. Also, if most queries genuinely need most tools (e.g., a general-purpose assistant), aggressive routing risks excluding relevant tools and hurting quality.

Exercise 34.9.2: Cache Hit Rate Estimation

A tool caching system uses exact-match caching (SHA-256 hash of tool name and arguments). The agent handles a customer support workload where 30% of queries are about order status (with unique order IDs) and 70% are FAQ-style questions that map to a small set of tool calls.

  1. Estimate the expected cache hit rate for exact-match caching. Which query type benefits more from caching?
  2. Would adding semantic caching (cosine similarity on query embeddings) improve the hit rate? For which query type?
  3. Design a cache eviction policy appropriate for this workload.
Show Answer

1. FAQ queries (70% of traffic) likely map to a small set of tool calls with high repetition, so the exact-match hit rate for these could be 80%+ after warm-up. Order status queries (30%) have unique IDs, so exact-match hits are near 0%. Overall hit rate: roughly 0.70 * 0.80 + 0.30 * 0.0 = 56%. FAQ queries benefit far more. 2. Semantic caching would primarily help order status queries: "Where is order ORD-123?" and "What is the status of order ORD-123?" are semantically similar but hash-different. However, the tool arguments differ (different order IDs produce different results), so semantic caching on the query alone would return incorrect results. Semantic caching works best for tool calls where similar queries produce identical results (e.g., "What are your return policies?" vs "How do I return an item?"). 3. Use a two-tier policy: LRU eviction for order-specific cache entries (short TTL, since order status changes frequently) and longer TTL for FAQ entries (these change only when policies change). Set a maximum cache size and evict the least-recently-used entries when full.

Key Takeaways

What Comes Next

This concludes Chapter 34 on emerging architectures and scaling frontiers. The next chapter, Chapter 35: AI and Society, shifts from technical architecture to the broader societal context in which these systems are deployed, examining economic impact, governance, and the evolving relationship between humans and AI.

References & Further Reading
Foundational Tool Use

Schick, T., Dwivedi-Yu, J., Dessì, R., et al. (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools." NeurIPS 2023.

The seminal paper showing that language models can learn when and how to call external tools through self-supervised training. Established the paradigm of tool-augmented language models discussed throughout this section.

📄 Paper

Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E. (2023). "Gorilla: Large Language Model Connected with Massive APIs." arXiv:2305.15334.

Fine-tunes an LLM to accurately generate API calls from documentation, dramatically reducing hallucination in tool invocations. Demonstrates that retrieval-augmented fine-tuning improves tool-use reliability.

📄 Paper
Tool Benchmarks & Scaling

Qin, Y., Liang, S., Ye, Y., et al. (2024). "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-World APIs." ICLR 2024.

Constructs a dataset and framework for training models to use over 16,000 real-world APIs, introducing depth-first search for multi-step tool planning. The primary reference for scaling tool use to realistic complexity.

📄 Paper

Li, M., Song, F., Yu, B., et al. (2023). "API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs." EMNLP 2023.

Provides a systematic benchmark for evaluating LLM tool use across planning, retrieval, and calling stages. Useful for understanding how tool-use capabilities are measured and compared.

📄 Paper

Wang, X., Ma, W., Meng, Y., et al. (2024). "T-Bench: Benchmarking Tool Use of Large Language Models." arXiv:2406.10573.

Evaluates tool use across diverse scenarios including multi-tool chains and error recovery. Highlights the gap between single-tool and multi-tool orchestration performance.

📄 Paper
Protocols & Infrastructure

Anthropic (2024). "Model Context Protocol (MCP) Specification." Anthropic Technical Documentation.

Defines an open standard for connecting AI models to external tools and data sources through a unified protocol. The leading candidate for standardizing tool integration discussed in this section's coverage of tool economy.

🛠 Tool