Data science & AI · Reference

What is a vector database?

A vector database is a database designed to store and search high-dimensional vectors, called embeddings, by similarity rather than exact match, enabling semantic search and powering retrieval-augmented generation for AI systems.

Embeddings and similarity search

A vector database stores embeddings: lists of numbers, produced by a machine-learning model, that place items with similar meaning close together in a high-dimensional space. Where a traditional database matches exact values or keywords, a vector database performs similarity search — given a query embedding, it returns the stored vectors nearest to it by a distance measure such as cosine similarity. This enables semantic search: finding results by meaning rather than exact wording, so that a query and a relevant document match even when they share no keywords.

Approximate nearest-neighbour search

Finding the exact nearest vectors by comparing a query against every stored vector is too slow at scale. Vector databases instead use approximate nearest-neighbour (ANN) algorithms, which use specialised index structures to return very close matches far faster, trading a little accuracy for large speed gains.

This indexing is the core technical capability that lets vector databases search millions or billions of high-dimensional vectors quickly enough for interactive use.

Role in retrieval-augmented generation

Vector databases are central to retrieval-augmented generation (RAG), a pattern that improves large language models by supplying them with relevant external information. Documents are embedded and stored in the vector database; at query time, the system retrieves the most relevant passages by similarity and adds them to the model's prompt. This grounds the model's output in specific, up-to-date sources, helping to reduce hallucination and letting a model draw on information beyond its training data.

Vector databases in research

For research and scholarly tools, vector databases enable semantic search over large collections — literature, datasets, or notes — retrieving conceptually related material that keyword search would miss. Their results depend on the embedding model used, so the same database can behave differently with different embeddings; this choice should be documented. Because retrieval quality shapes downstream answers in RAG systems, evaluating what is retrieved, not only what is generated, is part of sound methodology.

Key facts

At a glance

Definition: database storing and searching embeddings by similarity
Stores: high-dimensional vectors (embeddings)
Search: by similarity, not exact match (semantic search)
Speed: uses approximate nearest-neighbour (ANN) algorithms
Key application: retrieval-augmented generation (RAG)
Distance measures: e.g. cosine similarity

Common questions

FAQ

How is a vector database different from a traditional database?+

A traditional database retrieves records by exact matches or keywords. A vector database stores embeddings and retrieves items by similarity, finding the vectors closest to a query. This enables semantic search — matching by meaning rather than exact wording.

What is retrieval-augmented generation?+

Retrieval-augmented generation (RAG) improves a large language model by retrieving relevant documents — typically from a vector database — and adding them to the prompt. This grounds the model's output in specific sources, reducing hallucination and extending its knowledge beyond training data.

What are embeddings?+

Embeddings are numerical vectors, produced by a machine-learning model, that represent the meaning of text, images, or other data so that similar items lie close together in a high-dimensional space. Vector databases store and search over these embeddings.

Going deeper

Related on CASRAI

Sources

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.