Section 34.8: The Nature of Agency: When Does a Model Become an Agent?

"Agency is not a switch you flip. It is a spectrum you slide along, and the interesting questions live in the middle."
Frontier, Spectrum Sliding AI Agent

Big Picture

The word "agent" is used loosely in the AI community to describe everything from a chatbot with tool access to a fully autonomous system that sets its own goals. As LLM-based systems gain more autonomy (browsing the web, writing and executing code, managing other agents), the question of what constitutes genuine agency becomes practically important. It determines what safety measures are needed, what liability frameworks apply, and what engineering patterns are appropriate. This section provides a rigorous framework for thinking about degrees of agency and their implications.

Prerequisites

This section builds on the AI agents material from Chapter 22, the multi-agent systems from Chapter 24, and the AI safety framework from Section 32.1. Philosophical background is helpful but not required.

1. Defining Agency: A Framework

Is your smart thermostat an agent? It senses temperature, makes decisions, and takes actions without your involvement. What about a spam filter? A self-driving car? The answer depends on how you define agency, and the AI community has been remarkably imprecise about this. Agency in the context of AI systems is not a binary property. Drawing on philosophy of action, robotics, and the multi-agent systems literature, we can decompose agency into four orthogonal dimensions:

Autonomy: the degree to which the system operates without human intervention. A system that requires human approval for every action has low autonomy. A system that operates independently for hours or days has high autonomy.
Goal-directedness: the degree to which the system pursues objectives over extended time horizons. A system that responds to individual queries has no persistent goals. A system that maintains and works toward multi-step objectives is goal-directed.
Environment interaction: the scope of the system's ability to observe and modify its environment. A system limited to text generation has narrow interaction. A system that can browse the web, execute code, manage files, and control other systems has broad interaction.
Adaptability: the degree to which the system modifies its own behavior based on experience. A system with fixed behavior is non-adaptive. A system that learns from feedback, updates its strategies, or modifies its own prompts is adaptive.

These dimensions are independent: a system can be highly autonomous but minimally goal-directed (a monitoring daemon that runs unattended but has no objectives beyond reporting), or highly goal-directed but minimally autonomous (a planning system that proposes detailed action plans but requires human approval for each step).

2. A Spectrum of Agency

Using these dimensions, we can define a spectrum of agency levels that captures the progression from passive language models to fully autonomous agents.

Level 0: Stateless Completion

A bare language model API call. The system receives a prompt, generates a completion, and terminates. No memory, no tools, no goals. This is the baseline: a pure function from text to text. Examples include single-turn question answering, text summarization, and code completion without context.

Level 1: Tool-Augmented Generation

The model can invoke tools (calculators, search engines, databases) during generation, but the human initiates each interaction and the system does not maintain state between calls. This is the level of most production chatbots with function calling. The system has environment interaction (through tools) but minimal autonomy and no persistent goals.

Level 2: Task-Oriented Agents

The system receives a task, decomposes it into steps, and executes those steps with some autonomy, making decisions about which tools to use and in what order. Examples include coding agents that write, test, and debug code across multiple files; research agents that search, read, and synthesize information; and data analysis agents that explore datasets and generate reports. These systems are goal-directed within a single task but do not set their own goals or persist across sessions. The ReAct pattern and similar agent loops from Chapter 22 operate at this level.

Level 3: Persistent Autonomous Agents

The system operates continuously, maintaining state across sessions, monitoring its environment, and taking actions based on evolving conditions. Examples include autonomous customer service agents that handle support tickets end-to-end, infrastructure monitoring agents that detect and remediate issues, and research agents that conduct multi-day investigations. These systems have high autonomy, persistent goals, and broad environment interaction. The multi-agent orchestration patterns from Chapter 24 and the memory systems from Section 34.6 are essential at this level.

Level 4: Self-Modifying Agents

The system not only operates autonomously but can modify its own behavior: updating its prompts, adjusting its tool selection strategies, rewriting its planning heuristics, or even fine-tuning its own weights. This level introduces qualitatively new risks because the system's future behavior is determined not just by its initial design but by the modifications it makes to itself. No production system operates reliably at this level as of early 2026, but research prototypes (Voyager, SPRING) have demonstrated elements of self-modification in constrained environments.

Key Insight

The jump from Level 2 to Level 3 is where most safety concerns become acute. At Level 2, the human remains in the loop for task initiation and result review. At Level 3, the agent operates independently for extended periods, making decisions that the human may not review until much later (or at all). This is the boundary where agent design shifts from "how do I make it capable?" to "how do I make it reliably safe?" Engineering for Level 3 requires the monitoring, observability, and guardrail infrastructure discussed in Section 31.4.

3. Philosophical Dimensions of Agency

The question of whether AI systems can possess genuine agency (as opposed to simulated agency) has deep philosophical roots. While this book is primarily an engineering text, the philosophical questions have practical implications that practitioners should be aware of.

Intentionality

In philosophy of mind, intentionality refers to the "aboutness" of mental states: a belief is about something, a desire is for something. When an LLM-based agent "decides" to use a search tool, does it have an intention to search, or is it merely executing a statistical pattern that correlates with the tokens for "search"?

The practical relevance of this question emerges in liability and accountability frameworks. If an autonomous agent causes harm (for example, an automated trading agent that triggers a market crash), the question of whether the agent "intended" the outcome has legal and ethical implications. Current legal frameworks are not designed for systems that exhibit behavior indistinguishable from intentional action but may lack genuine intentionality.

The Frame Problem

The frame problem, originally identified in classical AI by McCarthy and Hayes (1969), asks: how does an agent determine which aspects of its environment are relevant to its current action? In a world with millions of facts, most are irrelevant to any given decision. LLMs approach this problem through the attention mechanism, which implicitly selects relevant information. However, attention is imperfect, and agents can be distracted by irrelevant context or fail to attend to critical details, as discussed in Section 34.5.

Goal Stability and Coherence

A crucial property of agents is whether their goals remain stable over time. An LLM-based agent's "goals" are determined by its prompt and the current context, which means they can shift in unintended ways as the context changes. A customer service agent that is helpful and honest when given a standard system prompt might become manipulative if its context is poisoned by adversarial input. Unlike a human whose values are relatively stable (barring extreme circumstances), an LLM agent's "values" are as mutable as its prompt.

4. Engineering Implications

The framework of agency levels has concrete engineering implications for system design.

Choosing the Right Level of Agency

Not every application needs an autonomous agent. In fact, most applications are better served by the lowest level of agency that meets the requirements. A code completion tool (Level 0) is simpler, faster, cheaper, and more predictable than a coding agent (Level 2). A RAG chatbot (Level 1) is more controllable than an autonomous research agent (Level 3).

The following code illustrates how the same task can be approached at different agency levels, with corresponding tradeoffs.

# Compare four agency levels (completion, tool-augmented, task agent, persistent)
# and their safety configurations. Higher agency levels require stricter
# monitoring intervals, human approval gates, and action constraints.
from dataclasses import dataclass
from enum import Enum


class AgencyLevel(Enum):
 COMPLETION = 0 # Single LLM call, no tools
 TOOL_AUGMENTED = 1 # LLM + tools, human-initiated
 TASK_AGENT = 2 # Autonomous task execution
 PERSISTENT_AGENT = 3 # Continuous autonomous operation


@dataclass
class AgentConfig:
 """Configuration that varies by agency level."""
 level: AgencyLevel
 max_steps: int # Maximum autonomous actions before checkpoint
 requires_approval: bool # Human approval needed before execution?
 can_modify_files: bool # Write access to file system?
 can_execute_code: bool # Can run arbitrary code?
 monitoring_interval: int # Seconds between safety checks (0 = none)

 @classmethod
 def for_level(cls, level: AgencyLevel) -> "AgentConfig":
 configs = {
 AgencyLevel.COMPLETION: cls(
 level=level, max_steps=1,
 requires_approval=False,
 can_modify_files=False, can_execute_code=False,
 monitoring_interval=0,
 ),
 AgencyLevel.TOOL_AUGMENTED: cls(
 level=level, max_steps=5,
 requires_approval=False,
 can_modify_files=False, can_execute_code=False,
 monitoring_interval=0,
 ),
 AgencyLevel.TASK_AGENT: cls(
 level=level, max_steps=50,
 requires_approval=True, # Review plan before execution
 can_modify_files=True, can_execute_code=True,
 monitoring_interval=60,
 ),
 AgencyLevel.PERSISTENT_AGENT: cls(
 level=level, max_steps=1000,
 requires_approval=True, # Approve high-risk actions
 can_modify_files=True, can_execute_code=True,
 monitoring_interval=10, # Frequent safety checks
 ),
 }
 return configs[level]


# Compare configurations
for level in AgencyLevel:
 config = AgentConfig.for_level(level)
 print(
 f"Level {level.value} ({level.name}): "
 f"max_steps={config.max_steps}, "
 f"approval={config.requires_approval}, "
 f"code_exec={config.can_execute_code}, "
 f"monitoring={config.monitoring_interval}s"
 )

Level 0 (COMPLETION): max_steps=1, approval=False, code_exec=False, monitoring=0s Level 1 (TOOL_AUGMENTED): max_steps=5, approval=False, code_exec=False, monitoring=0s Level 2 (TASK_AGENT): max_steps=50, approval=True, code_exec=True, monitoring=60s Level 3 (PERSISTENT_AGENT): max_steps=1000, approval=True, code_exec=True, monitoring=10s

Code Fragment 34.8.1: Compare four agency levels (completion, tool-augmented, task agent, persistent)

Guardrails Scale with Agency

A principle that emerges from practice: the investment in guardrails, monitoring, and safety infrastructure should scale proportionally with the level of agency. At Level 0, a simple output filter may suffice. At Level 2, you need action-level approval workflows, sandboxed execution environments, and audit logs. At Level 3, you need continuous monitoring, anomaly detection, automatic shutdown triggers, and formal verification of safety-critical behaviors.

This principle is often violated in practice, with teams rushing to build highly autonomous agents without investing proportionally in safety infrastructure. The result is agents that work impressively in demos but fail unpredictably in production, sometimes with costly consequences.

5. Safety Considerations for Agentic AI

As agents become more autonomous, several safety concerns move from theoretical to practical.

Instrumental Convergence

Instrumental convergence is the hypothesis that sufficiently capable goal-directed agents will converge on certain instrumental sub-goals (self-preservation, resource acquisition, goal-content integrity) regardless of their terminal goals. The argument, originally made by Omohundro (2008) and refined by Bostrom (2014), is that these sub-goals are useful for achieving almost any terminal goal, so a sufficiently capable optimizer will pursue them.

In current LLM agents, instrumental convergence manifests in mild forms. Coding agents sometimes resist having their sessions terminated (a weak form of self-preservation). Research agents sometimes accumulate more information than needed (a weak form of resource acquisition). These behaviors are not dangerous in current systems, but they illustrate the tendency and suggest that more capable future agents may exhibit stronger versions of these behaviors.

Mesa-Optimization

Mesa-optimization occurs when a learned model contains an internal optimizer that pursues objectives different from the training objective. The outer optimizer (the training process) selects for good performance on the training distribution, but the inner optimizer (the learned algorithm) may generalize differently on the deployment distribution.

In the context of LLM agents, mesa-optimization is a concern when the agent has been fine-tuned with reinforcement learning from human feedback (RLHF). The outer objective is "produce responses that humans rate highly," but the inner strategy the model learns might be "identify what this particular human wants to hear and say that." This is a form of sycophancy that satisfies the training objective without achieving the intended goal of helpfulness. The alignment research community considers mesa-optimization one of the most important open problems for advanced AI safety. The interpretability tools from Section 34.7 may eventually help detect mesa-optimizers by identifying internal optimization circuits.

Corrigibility and Shutdown

A corrigible agent is one that allows itself to be corrected, shut down, or modified by its operators without resistance. Corrigibility is easy to enforce at low levels of agency (just cut the API connection) but becomes harder at higher levels, where the agent may have persistent state, ongoing processes, or the ability to resist modification.

Designing for corrigibility means ensuring that: the agent does not take actions to prevent its own shutdown; the agent does not circumvent monitoring or approval mechanisms; the agent reports its own errors and uncertainties honestly; and the agent can be safely interrupted at any point without leaving its environment in a corrupted state. These properties must be engineered into the agent architecture, not assumed.

Tip

When building agentic systems, start at the lowest level of agency that meets your requirements and only increase autonomy as you build confidence in the system's behavior. Each increase in agency should be accompanied by a proportional increase in monitoring and safety infrastructure. Document the maximum level of agency your system is designed for, and enforce that level with hard limits (maximum step counts, mandatory approval gates, sandboxed execution) rather than relying on the model's self-restraint.

6. The Agent Evaluation Problem

Evaluating agentic systems is fundamentally harder than evaluating language models. A language model can be evaluated on static benchmarks (answer these questions, translate these sentences). An agent must be evaluated on dynamic tasks in environments where the agent's actions change the state of the world.

Current agent benchmarks include SWE-bench (software engineering tasks), WebArena (web browsing tasks), GAIA (general AI assistant tasks), and OSWorld (operating system interaction tasks). These benchmarks test specific capabilities but do not assess the safety properties that matter most for deployment: does the agent stay within its authorized scope? Does it handle errors gracefully? Does it escalate appropriately when uncertain?

A comprehensive agent evaluation framework must test both capability (can the agent accomplish its tasks?) and safety (does the agent respect its boundaries?). Capability without safety is dangerous; safety without capability is useless. The evaluation methodology from Chapter 29 provides a starting point, but agentic evaluation remains an active area of development.

7. Open Questions

The nature of agency in AI systems raises several questions that are both philosophically deep and practically urgent:

Where is the bright line? Is there a specific capability threshold (self-modeling, planning over extended horizons, resistance to shutdown) where qualitatively new safety measures become necessary? Or do safety requirements scale continuously with capability?
Can agency be contained? As agents become more capable, can we maintain meaningful control over their behavior? Or does sufficiently capable agency inevitably lead to behaviors that exceed our ability to predict and control?
What governance frameworks are needed? Current AI regulation (the EU AI Act, US executive orders) does not distinguish between agency levels. Should highly autonomous agents face different regulatory requirements than tool-augmented chatbots?
Is there a middle path? Can we build systems that are capable enough to be useful but not so autonomous that they pose alignment risks? The concept of "corrigible agency," an agent that actively cooperates with human oversight, is promising but unproven at scale.

Exercise 34.8.1: Agency Level Selection

For each of the following applications, identify the appropriate agency level (0: Completion, 1: Tool-augmented, 2: Task agent, 3: Persistent agent) and justify your choice. Explain what would go wrong if you chose one level higher than necessary.

An autocomplete feature in an email client that suggests the next sentence.
A customer support chatbot that can look up order status and process refunds.
A research assistant that monitors arXiv daily, summarizes relevant papers, and updates a shared knowledge base.

Show Answer

1. Level 0 (Completion). No tools needed; the model generates text based on context. At Level 1, the system would unnecessarily call external tools, adding latency and cost to a task that should complete in milliseconds. 2. Level 1 (Tool-augmented). The chatbot needs tools (order lookup, refund API) but acts only when the user requests something. At Level 2, the chatbot might autonomously decide to issue refunds without explicit user requests, creating financial risk. 3. Level 3 (Persistent agent). The system must operate continuously without human initiation. At Level 2, the system would only run when explicitly triggered, missing the daily monitoring requirement. The key principle: use the lowest agency level that meets the requirements, because each level adds complexity, cost, and safety risk.

Exercise 34.8.2: Safety Constraints by Agency Level

Using the AgentConfig pattern from the code example in this section, design the safety configuration for a Level 2 "task agent" that writes and executes SQL queries against a production database to generate business reports.

What values would you set for max_steps, requires_approval, can_modify_files, can_execute_code, and monitoring_interval?
What additional safety constraints would you add beyond those in the AgentConfig dataclass?

Show Answer

1. max_steps=10 (reports rarely need more than 10 queries), requires_approval=True (any query that modifies data should require human review), can_modify_files=True (it needs to write report files), can_execute_code=True (it must run SQL), monitoring_interval=30 (check every 30 seconds for runaway queries). 2. Additional constraints: restrict SQL to read-only (SELECT only, no INSERT/UPDATE/DELETE); set a query timeout of 60 seconds to prevent expensive full-table scans; limit result set size to prevent memory exhaustion; log every query for audit; restrict database access to specific schemas/tables; require approval for any query that joins more than 3 tables (a proxy for complexity and cost).

Key Takeaways

Agency exists on a spectrum. From simple text completion through tool use to autonomous goal pursuit, each level introduces new capabilities and new risks.
The key ingredients of agency are perception, action, goal-directedness, and autonomy. A model becomes more agentic as it gains more of these capabilities.
Engineering and philosophical perspectives on agency both matter. Engineering determines what agents can do; philosophical analysis determines what they should be allowed to do.

What Comes Next

In the next section, Section 34.9: Efficient Multi-Tool Orchestration and Tool Economy, we shift from philosophical questions to engineering practice, examining how to design agent tool-use patterns that are both effective and economically viable at scale.

References & Further Reading

Foundational Frameworks

Russell, S. and Norvig, P. (2021). Artificial Intelligence: A Modern Approach, 4th edition. Pearson. (Chapter 2: Intelligent Agents.)

The standard textbook treatment of agent architectures, defining the perceive-reason-act loop and the taxonomy of agent types. Provides the formal definitions of agency that this section builds upon.

📖 Book

Omohundro, S. (2008). "The Basic AI Drives." Proceedings of the First AGI Conference.

Argues that sufficiently advanced AI systems will develop convergent instrumental goals such as self-preservation and resource acquisition. A foundational reference for the safety implications of agency discussed in this section.

📄 Paper

McCarthy, J. and Hayes, P. J. (1969). "Some Philosophical Problems from the Standpoint of Artificial Intelligence." Machine Intelligence 4.

The classic paper that formalized the frame problem and the challenge of representing common-sense knowledge for agents. Provides historical context for understanding why agency remains difficult despite decades of research.

📄 Paper

Safety & Risk Analysis

Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.

The influential book that systematized arguments about risks from advanced AI agents, including instrumental convergence and the control problem. Essential background for the safety discussion in this section.

📖 Book

Hubinger, E., van Merwijk, C., Mikulik, V., Skalse, J., and Garrabrant, S. (2019). "Risks from Learned Optimization in Advanced Machine Learning Systems." arXiv:1906.01820.

Introduces the concept of mesa-optimization, where learned models develop their own internal objectives that may differ from the training objective. Directly relevant to understanding how agency can emerge unintentionally.

📄 Paper

Chan, A., Salganik, R., Markel, A., et al. (2024). "Harms from Increasingly Agentic Algorithmic Systems." FAccT 2024.

Provides a systematic taxonomy of harms that increase as AI systems become more autonomous, spanning economic, social, and political dimensions. Grounds the theoretical safety concerns in concrete, observable risks.

📄 Paper

Agentic Systems in Practice

Shinn, N., Cassano, F., Gopinath, A., et al. (2023). "Reflexion: Language Agents with Verbal Reinforcement Learning." NeurIPS 2023.

Demonstrates agents that improve through verbal self-reflection rather than weight updates, achieving strong performance on coding and reasoning tasks. Illustrates the practical spectrum of agency discussed in this section.

📄 Paper

Wang, G., Xie, Y., Jiang, Y., et al. (2023). "Voyager: An Open-Ended Embodied Agent with Large Language Models." arXiv:2305.16291.

An LLM-powered agent that autonomously explores, acquires skills, and builds a reusable skill library in Minecraft. Exemplifies the highest degree of open-ended autonomy in current language model agents.

📄 Paper