Vector databases and embeddings: a practical primer
What embeddings are, how similarity search works, and how to choose and operate a vector store without over-engineering it.
Most retrieval problems feel complicated until you understand what a vector database actually does. Strip away the marketing and you’re left with a surprisingly simple idea: encode meaning as a point in high-dimensional space, then find nearby points fast. Everything else, indexes, filters, re-ranking, is engineering built on top of that primitive.
What embeddings are
An embedding model takes a chunk of text (a sentence, a paragraph, a document) and maps it to a dense numeric vector, typically hundreds to a few thousand floating-point values. The model learns this mapping during training, so semantically similar text ends up geometrically close. “Invoice overdue” and “payment not received” land near each other; “invoice overdue” and “quarterly roadmap” do not.
The key property isn’t that the numbers mean anything in isolation. It’s that distances mean something. Once you have that, retrieval becomes arithmetic.
How similarity search works
Given a query, you embed it the same way you embedded your documents. Now you have two vectors. The standard distance metrics are cosine similarity (angle between vectors, insensitive to magnitude) and dot product (magnitude-aware, faster when vectors are normalized to the same scale). For most text retrieval tasks the results are equivalent.
The naive approach, score every stored vector against the query, is exact but O(n) per query. At millions of documents that becomes unacceptable. This is where approximate nearest-neighbor (ANN) indexes come in.
ANN indexes and the recall/latency tradeoff
ANN indexes trade a small amount of recall for a large reduction in query latency. Common families include:
-
HNSWgraph-based, high recall, memory-resident, query-fast -
IVF variantsinverted-file partitioning, disk-friendly, good at scale -
LSHhash-based, simpler, works when recall requirements are loose -
Flat / brute-forceexact, no index, fine below ~100k vectors
Every index exposes a recall parameter (sometimes called ef_search, nprobe, or
similar). Raising it recovers accuracy at the cost of latency. In practice you tune
this against your acceptable p95 query time and your tolerable miss rate. Most
production workloads land at 95 to 99 % recall because that last fraction of a percent
costs a disproportionate amount of compute.
Metadata filtering
Pure semantic search is rarely enough. A user querying “account settings” shouldn’t retrieve documents from the wrong tenant, the wrong language, or a deprecated product version. Metadata filters let you restrict the search space before (pre-filter) or after (post-filter) the ANN pass.
Metadata design matters as much as embedding quality. Date ranges, category tags, access-control labels, and version identifiers all belong in your payload schema, not buried in the document text.
When you need a dedicated vector database
Not every project does. A small corpus that rarely changes can live in a plain JSON file queried with brute-force cosine scoring. A relational database with a vector extension handles moderate scale without adding operational complexity. The case for a standalone vector store strengthens when:
- you have tens of millions of vectors or more
- you need sub-100 ms p95 latency under concurrent load
- your filtering logic is complex and needs efficient indexing alongside the ANN index
- you want purpose-built features like hybrid search (keyword + semantic combined)
If none of those apply yet, add the dedicated store later. Premature infrastructure is a retrieval problem in its own right.
Operational concerns
Two things surprise teams after their first deployment.
Re-embedding when models change. Your index is coupled to the model that produced it. If you upgrade the embedding model, even to a semantically better one, you must re-embed the entire corpus and rebuild the index. Plan for this. Keep the original source text; never store only the vectors.
Index freshness. Most ANN indexes are built once and updated in bulk. Incremental inserts are supported by major libraries but degrade index quality over time; you periodically need a full rebuild. For high-write workloads, segment the index or use a store that manages this internally.
The connection to agent context
For agents, retrieval isn’t decoration, it’s how they get context they weren’t pre-trained on. A well-designed retrieval layer becomes the resource feed that an MCP server exposes to any agent that reads from it. The agent doesn’t know about cosine distances or HNSW graphs; it reads structured, relevant context through a typed resource endpoint.
That’s the architecture worth building toward: retrieval quality improves silently inside the resource layer, and every agent consumer benefits without code changes. More on how retrieval fits the broader context layer in the RAG and context post and in the resource overview.
Good retrieval is invisible infrastructure, the agent sees relevant context, not index mechanics.