Choosing and upgrading LLM models without the hype

A new model family drops every few weeks. Each announcement comes with a chart where its line curves upward in the top-right corner, and a list of benchmarks you have never heard of. Meanwhile your production system still runs on the one you picked nine months ago, because nobody has had the time to evaluate anything new, and nobody is confident the switch won’t break something.

This is the real model selection problem. Not “which is smartest” but “how do we decide, safely, with limited time, without betting the system on a benchmark somebody else designed for somebody else’s task?”

Start with your tasks, not someone else’s leaderboard

Public benchmarks are built by labs to evaluate labs. They measure aggregate reasoning ability, multilingual fluency, code completion, and a dozen other properties that may have nothing to do with what your agent actually does.

Your agent summarises support tickets. Or drafts contract clauses. Or routes incoming requests to the right tool. Those tasks have concrete, testable expectations, and nobody has a leaderboard for them.

The first step in model selection is to write down your three to five most critical tasks, then build a small private eval set for each: twenty to fifty representative inputs, with at least one rubric per input that a non-expert could apply. That set becomes your ground truth. It doesn’t need to be exhaustive; it needs to be yours.

The axes that actually matter

Once you have your eval tasks, you can score candidate models on the dimensions that your system cares about. Those dimensions are rarely just “quality.”

Decision criteria for model selection

Quality on your tasks score on your private eval set, not published benchmarks
Latency profile p50 and p95 response times under realistic load
Cost at your scale per-call and monthly at projected token volumes
Context window fit does it reliably handle your largest documents?
Tool-call reliability does it follow your schemas consistently, without hallucinated arguments?
Data and compliance terms where does inference run, is input retained, what does the DPA say?
Upgrade stability does the vendor version-lock or silently update the model?

Not every axis carries equal weight. A customer-facing chat feature cares about latency far more than a nightly batch summariser. A legal workflow cares about data residency in ways a public-facing content tool does not. Weight the axes before you score them.

Leaderboards have two structural problems. First, they are easy to overfit: a model trained with knowledge of the benchmark’s format and topic distribution will score higher without becoming more useful. Second, they measure general ability, which aggregates across tasks in ways that can mask weakness on the specific things you need.

A frontier model that tops a public coding benchmark may still produce invalid JSON tool calls when your schema has optional nested arrays. A small, fine-tuned, open-weight model may outperform it on your exact classification task at a fraction of the operating cost.

Neither of those facts will appear on the leaderboard chart.

Piloting, not wholesale switching

When a candidate model scores well on your eval set, resist the temptation to immediately cut over your entire system. Instead, shadow-route a slice of real traffic , perhaps five to ten percent, to the new model while keeping the existing one live. Compare outcomes on metrics you already track: task completion rate, downstream error rate, user escalation rate, whatever your system uses as a signal of quality.

Two to four weeks of shadow traffic at modest volume will surface edge cases your eval set missed. It will also confirm whether the latency you measured in staging holds under production patterns.

Only after that pilot produces stable numbers should you widen the rollout. See LLM cost optimisation for how to structure tiered routing once you have multiple validated models in play.

The architectural punchline: put models behind your capability layer

None of the above matters as much as it should unless your architecture makes model swaps cheap. If your agent’s business logic, what tools it calls, what policies it checks, what data it reads, is tangled with a specific model’s client library or prompt format, then every model upgrade becomes a refactoring project.

The MCP-first answer is to keep models behind a clean capability layer. Your system exposes tools, resources, and actions as structured, model-agnostic contracts. The model is a caller of those contracts, not a load-bearing part of their definition. Swapping the model is then a configuration change: point to the new endpoint, update the system prompt if needed, re-run evals. The capability definitions, the policy checks, and the audit trail don’t move.

This is the same principle that makes the architecture resilient to infrastructure changes generally. The model is infrastructure. Treat it like infrastructure.

If swapping your model requires a code change, your model is too deeply embedded in your system. Capability definitions should outlive any individual model.

The architectural rule

Model selection is not a one-time decision. New families will keep appearing, costs will keep falling, and the task that required a frontier model today may run adequately on a smaller one six months from now. Build the eval discipline, weight the axes that fit your context, pilot before committing, and above all, build a system where the model is always replaceable.

Start with your tasks, not someone else’s leaderboard

The axes that actually matter

Why blind trust in public leaderboards is dangerous

Piloting, not wholesale switching

The architectural punchline: put models behind your capability layer