Prompting for long-horizon reasoning: effort and self-checks
Techniques that make models reliable on long, multi-step tasks, decomposition, explicit effort budgets, scratchpads, and self-verification passes.
A single-turn prompt is a request. A long-horizon task is a contract. The model needs to hold an objective across many steps, avoid drifting, detect when something has gone wrong, and arrive at a result that actually satisfies the original intent. Most prompt failures on multi-step work aren’t model failures, they’re missing structure.
The techniques below are well-established, order-independent, and composable. Use any subset; each one narrows the gap between what you asked and what gets delivered.
Decompose before you act
A model that jumps straight to execution on a complex task will solve a simpler version of the problem than the one you gave it. Ask for a plan first.
Before taking any action, list the sub-goals required to complete this task.
Number them. Identify any dependencies between them. Then proceed step by step,
stating which sub-goal you are currently addressing.
This forces disambiguation upfront. Ambiguous requirements surface as gaps in the plan, not as silent wrong turns buried five steps in. If you’re running an agent with tools, the planning pass also gives you a natural confirmation gate before anything consequential executes.
Give the model an effort budget
Open-ended instructions invite open-ended effort, which often means either stopping too early or rambling. An explicit budget anchors both depth and scope.
Effort budgets take different shapes depending on the task: a step count, a minimum number of sources to consult, a requirement to handle all items in a list before summarizing, or an explicit instruction to exhaust a search space before returning. The form matters less than the specificity.
Use a scratchpad for intermediate state
Working memory inside a context window is real but fragile. On long tasks, early conclusions get diluted by everything written after them. Asking the model to maintain an explicit scratchpad, a running, structured record of what it has found, decided, and still needs to do, keeps critical state visible and addressable.
-
Current sub-goalwhich step is in progress right now -
Confirmed factswhat has been verified, not just retrieved -
Open questionsthings that need resolution before finishing -
Decisions madechoices taken and the reasoning behind them -
Remaining stepswhat is still outstanding
For agents that run across multiple turns or tool calls, externalizing this state to a tool, writing it to a resource the agent can read back, is more reliable than relying on in-context memory alone. That is the prompt-side argument for MCP-first workflows: durable procedures that store state where the agent can actually find it.
Add a self-verification pass
The most consistently underused technique is the self-check. Before the model finalizes output, ask it to read the result against the original requirements.
Before returning your final answer, re-read the original task requirements and
verify:
1. Every requirement is addressed.
2. No step was skipped or assumed without justification.
3. Any outputs that must be consistent with each other (IDs, totals, names) are.
List any discrepancies you find, then correct them.
This is not asking the model to “double-check” in a vague sense. It is giving it a concrete rubric. The model compares a specific thing against a specific list. The difference in output quality is measurable.
Externalize state rather than remember it
There is a limit to what models reliably carry across dozens of steps. The solution is not to push that limit further with better prompts, it is to stop requiring the model to hold things in memory that could live somewhere else.
For tasks that involve accumulating results, a list of processed items, a ledger of decisions, a set of constraints that must remain consistent, give the model a tool it can write to and read back from. Prompts that say “keep track of X” are weaker than architectures where X is a resource the agent can actually retrieve.
This is the core argument behind long-horizon agent patterns: state that lives outside the model is state that survives context resets, parallel branches, and handoffs between agents.
The compound effect
None of these techniques is exotic. Decomposition, explicit budgets, scratchpads, self-checks, external state, each one addresses a specific failure mode. Together they shift a model from opportunistic to systematic: it plans, it tracks, it verifies, it externalizes.
That shift is not a property of the model. It is a property of the prompt and the architecture around it.
A reliable long-horizon agent is not a smarter model, it is a model with a durable procedure and nowhere to cut corners.