Self-hosting open models without a GPU farm
When local inference makes sense, and how quantization and right-sizing let capable open-weight models run on modest hardware, with the tradeoffs spelled out.
The phrase “self-hosted AI” used to conjure images of a server room full of expensive accelerators and a dedicated ML infrastructure team to keep them alive. That picture has quietly become outdated. Open-weight model families have matured, quantization tooling has improved dramatically, and inference runtimes designed for consumer hardware have moved from research curiosity to production-grade software. Running a capable model locally is now an engineering decision, not a capital expenditure.
That does not mean it is always the right decision. But when it is, understanding the mechanics, and the honest limits, is the difference between a system that works and one that disappoints.
Why run locally at all
Before reaching for a self-hosted model, it helps to be clear-eyed about which pressures actually motivate it. There are four worth taking seriously.
-
Data residencysome data legally or contractually cannot leave your infrastructure -
Cost at volumehigh-frequency, low-complexity tasks get cheaper as you scale inference yourself -
Offline operationfactory floors, aircraft, medical devices, and remote sites can't depend on connectivity -
Operational controlno upstream API changes, rate limits, or deprecations to react to
If none of these apply strongly, a managed cloud model is almost certainly the simpler path. The operational burden of running your own inference server, model management, hardware provisioning, updates, monitoring, is real and should not be understated. Self-hosting earns its keep when the alternative forces you to compromise on trust, cost, or reliability. Not before.
Quantization: the key lever
Most open-weight models are released with full-precision weights, typically 16-bit floating point. These weights are accurate but large. A model with several billion parameters can require tens of gigabytes of memory just to hold the weights, before any inference overhead. For consumer GPUs or CPU-only hardware, that is a hard wall.
Quantization is the technique that moves the wall. Rather than storing each weight as a 16-bit float, quantized representations use 8-bit integers, 4-bit integers, or increasingly creative mixed-precision schemes that reduce memory footprint by a factor of two to four or more. The model is compressed; the tradeoff is that some numerical precision is lost in the process.
In practice, lightly quantized weights, in the 8-bit range, often produce output that is perceptually indistinguishable from full precision on well-defined tasks: structured extraction, classification, short summarisation, code generation with clear specifications. Heavily quantized weights are more likely to show degradation on tasks requiring nuanced reasoning, long-range coherence, or reliable instruction-following under complex constraints.
The implication is that quantization only makes sense in context of the task. The same quantized model that works excellently for invoice extraction may produce inconsistent results when asked to reason through an ambiguous multi-step problem. Knowing where the capability falls off is not optional, it is part of using these tools responsibly.
Right-sizing: the model should fit the task, not your ambition
Open-weight model families span a wide range of scales. Smaller models start inference faster, consume less memory, and produce output more quickly, at the cost of raw capability. Larger models within the same family handle harder tasks but require significantly more memory and are slower per token on the same hardware.
The mistake most teams make is defaulting to the largest model they can load, on the assumption that larger is always better. For many workloads, this is simply wrong.
A smaller model that fits comfortably in available VRAM with room to spare will run inference at full throughput and handle concurrent requests without contention. A larger model that barely fits will thrash. And a task that does not require sophisticated reasoning, format normalisation, entity tagging, routing signals, short Q&A on structured data, is not improved by throwing more parameters at it. It just costs more compute and runs slower.
The practical heuristic is to start with the smallest model that produces acceptable output on your actual task distribution, measured against real examples, not synthetic benchmarks. Scale up only when evaluation shows a gap that matters. This is the same logic as right-sizing cloud model calls, applied to local infrastructure.
Inference runtimes and serving
A quantized model file is not a running service. Between the weights and a callable API sits an inference runtime: software that loads the model, manages memory, batches requests, and exposes an interface your application can call.
Several runtimes have matured into solid options for different hardware profiles. Some are optimized for NVIDIA GPU acceleration, others run efficiently on CPU alone or on Apple Silicon unified memory, and others handle mixed GPU/CPU offloading when a model does not fully fit in VRAM. The ecosystem is active and the tooling has improved substantially.
What to look for when evaluating a runtime:
-
Hardware compatibilityGPU vendor, CPU-only fallback, unified memory support -
Quantization format supportnot all runtimes support all quantization schemes -
Concurrency modelhow does it handle simultaneous requests at your expected load? -
API surfaceOpenAI-compatible endpoints make swapping models transparent to callers -
Observabilitytoken throughput, queue depth, memory usage, you need these in production
For teams running an MCP-first architecture, the clearest win is wrapping the local inference endpoint behind a capability boundary. The caller, an agent, an orchestrator, a tool, invokes a named capability and never needs to know whether the model underneath is local or cloud. Swap a cloud endpoint for a local one behind that boundary and the change is invisible to every upstream system. This is the right layer to enforce which requests may go local and which must not, connecting directly to data protection classes and the residency policies those classes define.
The honest tradeoffs
Self-hosting open-weight models is not a free lunch. The capability ceiling is real: frontier cloud models still meaningfully outperform what you can run locally on modest hardware, particularly on tasks requiring extended reasoning chains, nuanced judgment, or reliable performance across diverse edge cases. Quantization adds headroom but does not close this gap.
The operational burden is also non-trivial. Provisioning hardware, managing model versions, monitoring for degradation, handling out-of-memory conditions under load, and updating runtimes as the ecosystem evolves, these are engineering costs that a managed API offloads entirely. Teams without dedicated infrastructure capacity should factor this in honestly before committing to self-hosting.
That said, when the combination of data residency requirements, volume economics, and task fit lines up correctly, a well-configured local model running quantized weights on reasonably modern hardware is a serious production option. The tooling has matured enough that the question is no longer whether it can be done, it is whether the constraints of your specific situation make it worth doing.
Self-hosting open models earns its complexity when the data is too sensitive for the cloud, the volume is too high for cloud pricing, or the network is too unreliable for a cloud dependency, and not otherwise.
The path from curiosity to production starts the same way as any capability decision: understand the task, measure what good output looks like, and let the constraints point to the right tier. For a deeper look at how sensitivity should drive those routing decisions end-to-end, the on-device vs cloud decision framework covers the full spectrum.