← Blog

Cut your LLM bill by 50 to 90%: caching, routing, right-sizing

Practical levers to shrink inference spend without hurting quality, prompt caching, model routing, context discipline, and capability-level budgets.

Most teams discover their LLM bill is a cost problem the same way they discover a water leak, slowly, then all at once. Inference spend that felt fine at prototype scale can multiply ten times when a handful of agents run continuously in production. The good news: the remedies are well-established, composable, and largely independent of which models you use.

This post covers the practical levers teams apply most successfully, roughly in order of impact.

Prompt caching: pay once, reuse many times

Every LLM API processes tokens sequentially. If you repeat the same system prompt, long document, or tool schema on every call, you are paying full price for input tokens that have not changed since the last request.

Most frontier providers now offer some form of prompt caching: you mark a stable prefix in your request, the provider stores the KV representation, and subsequent requests that share that prefix are billed at a steep discount, often in the range of 80 to 90% less for the cached portion. Caching works best on:

  • System prompts and persona instructions that never change between calls
  • Large reference documents or tool manifests included in every request
  • Few-shot examples attached to every invocation of a specific capability

The only cost is discipline: keep the stable prefix stable. Moving content in and out of the cached region defeats the mechanism. Treat your shared context block like a compiled artifact that you update deliberately, not per-request.

Model routing: not every task deserves a frontier model

Frontier models are powerful and expensive. A surprising share of production workload is not frontier-hard.

Classification, intent detection, routing decisions, simple summarisation, and structured extraction from clean text are tasks where a well-prompted smaller model often performs comparably to a much larger one, at a fraction of the cost. Teams that route these tasks deliberately typically see overall inference spend drop significantly even before touching caching.

A practical starting point: audit the last thousand calls your system made. For each, ask whether the task required the capability you paid for. Most teams find a sizeable cluster of calls that could shift to a cheaper tier immediately.

The architecture discussion covers how capability contracts let you make this routing decision per tool, not per integration.

Context discipline: only retrieve what the model needs

Context windows keep growing. This is a trap.

Every token in context costs money. Longer context also tends to push models toward middle-of-window attention degradation, which quietly erodes quality. Sending a 60,000-token document when the model needs four paragraphs is both wasteful and counterproductive.

The corrective approach is retrieval-scoped context:

  • Semantic retrieval embed and search; send only the top-k relevant chunks
  • Tool-scoped context each capability receives only the state it actually reads
  • Structured references pass IDs or typed summaries instead of raw full documents
  • Rolling compression summarise older turns rather than replaying the full transcript

None of this requires sophisticated infrastructure. Even a simple sliding-window truncation policy is better than no policy at all.

Batching and async: remove the cost of urgency

Synchronous calls carry a latency premium. Many tasks in an agent workflow are not actually time-sensitive: nightly report generation, background classification, bulk enrichment, log analysis. Running these as batched async jobs often attracts significantly lower rates from providers and allows you to fill cheaper capacity windows.

The discipline here is separating the user is waiting right now from this needs to happen eventually. Most workflows contain more of the second than teams initially assume.

Caching deterministic results: skip the model entirely

Some calls to an LLM produce the same output for the same input every time, entity extraction from fixed templates, format normalisation, routing logic applied to a closed set of inputs. These are not model tasks; they are lookup tables waiting to be written.

Per-capability budgets: measure cost where work happens

Cost-per-token is the wrong unit for product decisions. The right unit is cost per task: what does it cost to complete one summarisation, one routing decision, one draft generation?

When budgets are tracked at the capability level, you can set limits that match business value. A low-value background task gets a strict token budget; a revenue-generating synthesis gets a generous one. Neither needs to borrow from the other.

This is where an MCP-first architecture pays a structural dividend. Because capabilities are named, typed, and independently invoked, every tool call carries an identity. You can attach model choice, context limits, caching policy, and spend caps directly to the capability definition, and no caller needs to know. Swap a frontier model for a cheaper one behind a capability boundary and every agent that uses it inherits the saving automatically.

This is described in more detail in the manifesto: the capability layer is the right place for operational policy, not just business logic.

What to track

Three metrics capture most of what you need:

  1. Cost per task type, which capabilities are expensive relative to their value?
  2. Cache hit rate, are your stable prefixes actually stable in practice?
  3. Model tier distribution, what fraction of calls are landing on each tier?

When these numbers are visible, optimisation becomes routine rather than reactive.

Right-sizing LLM spend is not about using worse models, it’s about using the right model, with the right context, exactly once per task.

The practical conclusion