On-device or cloud? Choosing where your models run
A decision framework for splitting AI workloads between local and cloud models, privacy, latency, cost, and capability, plus how sensitivity should route the data.
Every AI workload carries an implicit question that most teams answer by accident: where does the inference actually happen? The model provider’s data centre, a self-hosted server, or a chip sitting inside the device that holds the data. Each answer carries a different bundle of tradeoffs, and confusing them is a source of both unnecessary cost and unnecessary risk.
The four axes that matter
Privacy and data residency. When you send a prompt to a cloud model, the payload leaves your trust boundary. For most content that is fine. For payroll records, health notes, legal drafts, or any data governed by a residency regulation, it may not be. On-device or self-hosted inference keeps the bytes where they started.
Latency. Network round-trips add time. A cloud call typically adds tens to hundreds of milliseconds before a token appears, acceptable for a scheduled report, noticeable for a streaming chat interface, and a hard blocker for anything embedded in a synchronous user action. Local inference, once the model is loaded, has no network floor.
Offline capability. Devices in hospitals, aircraft, industrial floor environments, or simply bad network conditions cannot depend on reachability. An on-device model runs whether or not there is a signal. A cloud model does not.
Raw capability ceiling. Frontier models running on large cloud infrastructure still substantially outperform what fits on a consumer device at comparable speed. That gap is narrowing, capable, quantised small models are a real and growing category, but it has not closed. Complex reasoning, long-context synthesis, and nuanced instruction-following still favour the cloud end of the spectrum.
Hybrid routing: spend inference where it earns its cost
The practical answer for most systems is not “on-device” or “cloud” but a routing policy that sends each request to the right tier. Cheap, well-defined, high-volume tasks, classification, extraction, summarisation of short text, formatting, are strong candidates for a smaller local model. Rare, high-stakes, or genuinely hard tasks justify the latency and cost of a frontier call.
Cost compounds quickly in agent systems. An orchestrator that fires a frontier model call for every sub-step in a workflow pays the premium on every hop, not just the final answer. Routing even half those calls to a capable local tier can reduce inference cost by an order of magnitude without touching output quality for the routine steps.
Sensitivity should determine routing, not just cost
This is where most teams leave value on the table. They treat on-device as a cost lever but not a privacy lever. In an MCP-first system, every resource and action carries risk metadata. That metadata should actively drive where inference happens, not as a suggestion, but as a policy.
The routing table looks something like this:
-
public.contentmarketing copy, product descriptions Low -
user.preferencesUI settings, display options Low -
analytics.aggregateanonymised usage stats Medium -
user.email_historypersonal correspondence High -
finance.payrollsalary and compensation data Restricted -
health.recordsmedical notes, diagnoses Restricted -
legal.privilegedattorney, client communications Forbidden for AI
Resources marked Low or Medium can flow to any model that produces good results, including external cloud endpoints. Resources marked Restricted must stay within a trusted boundary, a local model, a self-hosted endpoint, or an agreed private deployment. Resources marked Forbidden for AI should never be included in a model prompt at all; they are surfaced to humans, not to inference engines.
This is the difference between treating model choice as an operational concern and treating it as part of your risk model. One is tuned for cost. The other is tuned for trust.
What this means in practice
A few principles that hold across architectures:
- Classify before you route. If your capability layer annotates resources with sensitivity levels at definition time, routing becomes a mechanical policy, not a case-by-case judgment made at runtime by an agent.
- Keep the routing rule in the server, not the model. An agent that decides for itself which model to call based on a system prompt cannot be audited or enforced. A routing policy in your MCP server can.
- Test the boundaries explicitly. Run a restricted resource through your routing layer in a test harness and assert it never reaches an external endpoint. This is a compliance test, not just a unit test.
The decision in one sentence
Let capability ceiling pull workloads toward the cloud; let sensitivity push them back to the edge.
The on-device versus cloud question is ultimately a tension between what a model can do and what it should see. Getting that balance right is less about picking the best model and more about building the infrastructure that routes data to the right one. The resources section has more on capability classification and how to model sensitivity in a compliant system.