RAG: giving models the right context, not all of it
Retrieval-augmented generation explained, chunking, embedding, retrieval, and the discipline of feeding a model only the context a task actually needs.
A language model knows what it was trained on. It does not know what happened in your database last Tuesday. Retrieval-augmented generation, RAG, is how you close that gap without turning every prompt into an encyclopedia dump.
The idea is simple: before asking the model a question, fetch the pieces of your data that are actually relevant to that question, and include only those pieces in the prompt. The model reasons over real information rather than confabulating from stale weights. Done well, RAG makes answers more accurate, more grounded, and far cheaper than alternatives like full fine-tuning or unbounded context windows.
A model doesn’t need all your data. It needs the right slice of it, right now.
The four-step pipeline
Every RAG implementation, regardless of tooling, follows the same shape.
Chunk. Raw documents are split into smaller units, paragraphs, sections, sliding windows of sentences. The unit size matters more than most teams expect. Chunks that are too large pull in irrelevant sentences that dilute the signal. Chunks that are too small lose the surrounding context that gives a sentence meaning. The right size depends on the content type: dense technical prose often needs larger chunks than FAQ entries.
Embed. Each chunk is passed through an embedding model to produce a vector, a compact numerical representation of the chunk’s meaning. Semantically similar chunks end up close together in this vector space, which is what makes retrieval by meaning rather than keyword possible. The embedding model you choose should match the domain; a model trained on general web text may not capture the vocabulary of your internal engineering docs.
Store. The vectors (and the original text they came from) live in a vector store. Indexes are built over the vectors so that nearest-neighbour queries can run fast at scale. For a deeper look at how these stores work, see the vector databases and embeddings guide.
Retrieve and ground. At inference time, the user’s query is embedded with the same model, and the store returns the top-k most similar chunks. Those chunks are injected into the prompt as context. The model reads them and produces an answer grounded in that material rather than in vague training-time associations.
-
Chunkingsignal density, too big loses precision, too small loses meaning -
Embeddingsemantic resolution, domain match matters -
Vector storeretrieval speed and freshness -
Top-k retrievalhow many chunks reach the prompt -
Prompt assemblyhow context is ordered and framed for the model
Where RAG goes wrong
The pipeline looks straightforward. In practice, three failure modes account for most degraded RAG outputs.
Bad chunking. Splitting at arbitrary token counts, rather than at semantic boundaries like headings or paragraphs, creates chunks that start mid-thought or end before the key fact. The retrieved chunk looks relevant by embedding distance but is missing the sentence that actually answers the question.
Retrieving too much. There is a temptation to raise k and throw everything at the model “just in case.” The result is a bloated prompt where the genuinely relevant passage is buried in marginally-related context. Models attend unevenly over long contexts; the useful signal gets diluted. Tight, targeted retrieval consistently outperforms broad retrieval, both in quality and in cost.
Stale indexes. If the underlying data changes and the index is not updated, retrieval returns answers grounded in yesterday’s state of the world. For live operational data, prices, inventory, user records, index freshness is as important as retrieval accuracy. Either re-index on write or accept that some queries need a real-time lookup rather than a vector search.
The discipline of relevant, not exhaustive
The failure modes above share a root cause: treating retrieval as “get as much as possible” rather than “get exactly what is needed.” RAG is most powerful when it is treated as a filtering problem. The vector store is not a shortcut to dump your whole knowledge base into the prompt, it is a precision instrument for isolating the fragment of your data that makes a particular answer trustworthy.
This discipline extends beyond accuracy. Retrieving only the relevant context also means retrieving only the context the caller is entitled to see. A well-structured retrieval layer can enforce access scoping at query time: a user in one team retrieves only chunks tagged to their namespace. The model never sees data it shouldn’t, because that data was never fetched.
RAG and MCP-first: context as a first-class resource
If you’re designing an agent system, RAG fits naturally into the MCP-first model as a
resource: a named, permissioned, agent-callable source of grounded context. An agent
that needs to answer a question about a customer contract calls a contract-context
resource, not a raw database. The resource layer handles retrieval, filtering, and
redaction; the agent receives clean, scoped context.
This is the “context over raw data” principle in practice. An agent shouldn’t receive a full CRM export any more than a human should receive a printout of the entire database to answer one question. The right architecture delivers the relevant slice, already filtered to what the task needs and what the caller is allowed to read.
For the broader argument about how resources fit into an agent-controllable system, the
/manifest.ai spec defines what it means for a resource to be
safe, scoped, and auditable.
RAG isn’t about giving models more context. It’s about giving them the right context, and nothing else.