Design patterns for long-horizon agents

A single-turn agent is straightforward to reason about: one prompt, one answer, one action. A long-horizon agent is a different creature entirely. It runs for minutes or hours, calls dozens of tools, forks into parallel sub-tasks, accumulates state across steps, and eventually does something irreversible, sends the email, merges the branch, charges the card. Every step it takes is an opportunity for small errors to compound into large ones.

The patterns below don’t require a specific framework. They are architectural disciplines that apply wherever an agent runs for more than a few steps.

Decompose before you act

The most common failure mode in long-horizon agents is ambition without structure. An agent given a large goal starts executing immediately and discovers halfway through that it misunderstood the scope, that step four depends on a resource that doesn’t exist, or that the goal was internally contradictory.

Decomposition inverts this. Before the agent touches any external system, it breaks the goal into a dependency-ordered plan: a directed acyclic graph of subtasks, each with a clear success criterion and an identified set of required capabilities. The plan is a first-class artifact, something that can be logged, reviewed, and amended, not an implicit mental model the agent carries in its context window.

A plan also surfaces irreversible steps before they are taken. When the agent builds the graph, it can annotate nodes by reversibility. Steps that can be undone (writing a draft, fetching data, creating a branch) can proceed freely. Steps that cannot be undone need a confirmation gate before execution, not after.

Delegate to focused sub-agents

A single agent accumulating context across fifty tool calls is brittle. The context grows stale, earlier reasoning drifts out of scope, and token pressure eventually forces the model to summarize rather than recall.

The alternative is delegation: the orchestrating agent stays thin, managing the plan and routing results, while focused sub-agents handle bounded subtasks. Each sub-agent receives only the context it needs, operates within a narrow capability scope, and returns a typed result. It does not share memory with sibling agents; it communicates through the orchestrator.

This mirrors how good engineering organizations work. The manager does not write every line of code; the specialist does not architect the whole system. Tight scope produces better outputs and makes failures easier to isolate.

Checkpoint and recover

Long-horizon agents fail. Networks timeout, rate limits hit, the model produces an unparseable response. Without checkpointing, a failure at step forty means restarting from step one, and rerunning forty steps of potentially write-bearing actions.

Checkpointing means persisting the plan’s execution state at regular intervals: which steps completed, what their outputs were, what the current pending step is. On restart, the agent loads the checkpoint and resumes from the last verified position rather than the beginning.

Two details matter here. First, checkpoints should be written after a step is confirmed complete, not before it starts, writing a checkpoint for a step that then fails produces phantom progress. Second, idempotency is the complement of checkpointing: if a step can be safely re-executed without double side-effects, recovery is trivial. Design tool calls to be idempotent wherever possible.

Verify intermediate results

Agents are optimistic. They tend to treat the output of step N as correct input for step N+1 without questioning it. In a ten-step chain, a small error at step three can invalidate every subsequent step while the agent proceeds confidently toward a wrong answer.

Self-verification adds a lightweight check after consequential steps: does this output match the expected schema? Does the claimed resource actually exist? Is this extraction consistent with the source data? The agent does not need to re-run the entire step, it needs enough evidence to decide whether to continue or flag for review.

Verification has a cost, so apply it selectively. Steps that feed irreversible actions downstream warrant stricter checks than steps that produce intermediate summaries. The risk model gives you a vocabulary for this: a step tagged critical downstream is a checkpoint trigger upstream.

Set explicit budgets

Without constraints, long-horizon agents expand to fill available resources. A task estimated at twenty tool calls becomes sixty. A task expected to finish in two minutes runs for twenty. The agent is not malicious, it is thorough, and thoroughness without limits is a reliability problem in production.

Budgets make constraints explicit: a maximum number of tool calls per plan, a wall-clock time limit per sub-agent, a ceiling on tokens consumed per step. When a budget is reached, the agent reports its current state and yields control rather than continuing silently.

Budget dimensions worth tracking

tool_calls total actions across the plan Medium
wall_time_s elapsed seconds per sub-agent Low
write_ops mutations to external systems High
tokens_consumed context growth across steps Low
irreversible_ops actions that cannot be undone Critical

Insert human gates at irreversible steps

Not every step in a long-horizon plan should be fully automated. Some actions cross a threshold where the cost of a wrong decision, the data deleted, the message sent to ten thousand users, the infrastructure torn down, exceeds what any automated verification can guarantee. These steps need a human in the loop.

Human gates are not a failure of agent design; they are correct agent design. The agent’s role is to prepare the decision: gather context, validate preconditions, surface the plan and its consequences clearly. The human’s role is to authorize the irreversible step. The agent resumes after authorization.

The pattern for gates follows naturally from the workflow system: the agent reaches a node tagged requires_confirmation, emits a structured request with all relevant context, and suspends. The human reviews, approves or rejects, and the workflow either proceeds or terminates. The approval event is part of the audit trail.

Steps that warrant a confirmation gate

bulk.delete record sets, storage objects Critical
comms.send_external email, SMS, webhooks to third parties Critical
infra.deprovision destroying compute or storage Critical
payments.charge any financial transaction Critical
iam.grant_elevated escalating access rights Restricted

Keep a legible audit trail

A long-horizon agent that succeeds is useful. A long-horizon agent that fails in a way you can diagnose and learn from is valuable. The difference is the audit trail.

Every step should emit a structured event: the action taken, the tool called, the inputs and outputs, the risk level, the timestamp, the agent and sub-agent identities. The trail is not primarily for debugging, it is for accountability. When a consequential action is questioned later, the trail answers: who decided this, on what basis, at what time, and under what authorization.

What this requires from your capability layer

These patterns are not free. Decomposition needs a plan representation. Checkpointing needs writable state. Budgets need enforcement, not just declaration. Human gates need a suspension mechanism and a resumption signal. Audit needs structured event emission at every tool call.

This is exactly what a machine-readable capability layer provides: typed tools with risk levels, policy-enforced confirmation gates, first-class audit events, and workflow nodes that can suspend and resume. The patterns above are sound architectural reasoning; the capability layer is the infrastructure that makes them operational.

An agent is only as reliable as the guardrails built into its capability layer.

Long-horizon reliability

The longer the horizon, the more consequential each individual step becomes, and the more every pattern above pays for itself.