How to Choose a Vector Database for LLM Workloads Today
Choosing the right AI infrastructure for vector search? We break down filtering, hybrid search, and write throughput—the metrics that actually matter for LLM apps.

Picking a vector database for your LLM application starts to feel simple until you actually run one in production. The marketing pages all show impressive approximate nearest-neighbor (ANN) benchmarks, and every option claims sub-millisecond recall. Then you add a metadata filter, your query latency jumps by 40x, and you realize the benchmark you relied on tested a workload nothing like yours.
The decision comes down to three capabilities that vendor pages often bury: how the database handles filtered vector search, whether it supports hybrid keyword-plus-vector queries natively, and how it behaves under write pressure. Get those right for your workload and the rest of the choice is mostly preference and operational familiarity.
We've helped many teams at Laxaar wire up RAG pipelines and semantic search features. The patterns below reflect real production trade-offs, not synthetic benchmarks on static datasets.
What you'll learn
- Why ANN benchmarks mislead most buyers
- Filtered vector search and why it changes everything
- Hybrid search: when BM25 plus vectors beats either alone
- Write throughput and indexing lag under real workloads
- Managed vs self-hosted: the real cost comparison
- Head-to-head: Pinecone, Weaviate, Qdrant, pgvector
- How to match a database to your actual workload
- Frequently Asked Questions
Why ANN benchmarks mislead most buyers
ANN benchmarks are a measure of raw nearest-neighbor recall and queries-per-second on a static, pre-indexed dataset with no filters applied. That's a useful stress test for the index algorithm itself. It tells you almost nothing about how the database will perform once you layer on the conditions every real LLM app needs.
The most widely cited benchmarks, like ANN-Benchmarks and the ones published by the vendors themselves, test single-tenant flat datasets of millions of pre-embedded vectors. Your RAG system will likely query across user- or tenant-scoped subsets with date ranges, category filters, or permission checks applied simultaneously. The index that wins the raw recall race often falls apart when a filter forces a post-scan across results, turning an O(log n) approximate search into something much closer to brute-force.
The honest conclusion is this: treat ANN benchmarks as a minimum bar, not a selection criterion. If a database can't win a raw recall test, scratch it. But if several clear the bar, the differentiation is elsewhere.
Filtered vector search and why it changes everything
Filtered vector search is the ability to apply structured metadata conditions at query time while still returning the top-k nearest neighbors from within that filtered subset. It sounds straightforward. The implementation is where databases diverge sharply.
There are two main approaches:
Pre-filter: apply the metadata filter first to produce a candidate set, then run ANN search within that set. Works well when filters are selective (small candidate sets). Degrades when the filter returns millions of records and ANN has to scan nearly the whole index anyway.
Post-filter: run ANN search first to retrieve the top-k candidates globally, then discard those that don't match the filter. Fast for broad filters, but if your filter is selective it discards most results and you end up with fewer than k items.
The best databases support payload-indexed filtering that integrates the filter directly into the graph traversal, avoiding both problems above. Qdrant's implementation of filtered HNSW is the most discussed example of this. Weaviate's filtered search also integrates at the index level through inverted indexes on payloads.
If your application has tenant isolation, date-range narrowing, or category scoping baked into every query, test filtered recall specifically. A 20ms unfiltered query becoming a 900ms filtered query is not a corner case — it's the default behavior on several popular options when filters are not index-integrated.
Hybrid search: when BM25 plus vectors beats either alone
Hybrid search is the combination of dense vector similarity (semantic) with sparse keyword scoring (usually BM25 or TF-IDF), then a re-ranking step to merge the two result lists. It reliably outperforms either method alone for most document retrieval tasks.
The reason is straightforward. Dense vectors are good at conceptual similarity but miss exact keyword matches that are highly relevant. A user searching for "ISO 27001 audit checklist" might get semantically related documents about general compliance frameworks when what they need is the document that literally says "ISO 27001." BM25 anchors recall on exact terms; vectors fill in synonyms and paraphrases.
Not every vector database ships native hybrid search. Some delegate to an external search layer like Elasticsearch or OpenSearch, requiring you to maintain two systems and write your own fusion logic. Others, like Weaviate and Qdrant, handle BM25 scoring natively alongside vector search, which keeps query logic in one place.
# Weaviate hybrid search example (v4 Python client)
results = collection.query.hybrid(
query="ISO 27001 audit checklist",
alpha=0.5, # 0 = pure BM25, 1 = pure vector
limit=10,
return_metadata=wvc.query.MetadataQuery(score=True)
)
The alpha parameter controls weighting between keyword and semantic scoring. A value around 0.5 to 0.7 is a reasonable starting point; tune it against your eval set. The real trade-off is operational: native hybrid search means less infrastructure, but if your team already runs Elasticsearch well, a two-system approach may be more maintainable despite the complexity.
Write throughput and indexing lag under real workloads
Most teams benchmark read performance and forget writes entirely. For RAG systems that ingest live documents, product updates, or customer data, write throughput and the lag between ingestion and queryability matter a lot.
HNSW indexes, the dominant algorithm across most vector databases, are expensive to update. Inserting a vector means finding its neighbors, which requires traversing the graph. Under high write load, some databases queue inserts and rebuild index segments asynchronously, meaning freshly ingested vectors aren't immediately queryable. That gap can range from seconds to minutes depending on the database and the load.
Flat indexes (brute-force) have no indexing lag because there's no graph to update, but query latency scales linearly with dataset size. They're only viable for small datasets (under roughly 100k vectors) or when you can batch queries heavily.
For write-heavy pipelines, check specifically:
- Whether the database supports an "upsert" operation or requires delete-then-insert
- Whether newly inserted vectors are immediately searchable or batched into the next index segment
- What happens to query latency when a reindex is triggered mid-operation
Qdrant and Milvus both have configurable segment policies that let you tune the write/query latency trade-off. Pinecone's serverless offering hides most of this but also removes your ability to tune it — a real constraint if you need tight freshness guarantees.
Managed vs self-hosted: the real cost comparison
The managed-vs-self-hosted question isn't primarily about cost at small scale. At low query volumes, managed options like Pinecone Serverless or Weaviate Cloud Services are often cheaper than provisioning and maintaining infrastructure. The cost equation flips somewhere around 50-100 million vectors with moderate query load.
| Factor | Managed (e.g. Pinecone) | Self-hosted (e.g. Qdrant on k8s) |
|---|---|---|
| Ops overhead | Near zero | Significant (backups, upgrades, HA) |
| Cost at scale | High (per-vector pricing) | Lower (compute only) |
| Tuning control | Limited | Full |
| Freshness SLA | Provider-controlled | You control |
| Egress costs | Can be significant | Minimal within cloud |
| Migration path | Vendor-dependent | Portable |
The less-discussed cost of managed databases is egress. If your application calls the vector database hundreds of millions of times a month, egress fees from queries and metadata can dwarf the base storage cost. Read the pricing page specifically for query egress before committing.
Self-hosting on Kubernetes with Qdrant or Milvus gives full control but demands operational maturity. We've seen teams at Laxaar save meaningful money at scale, and also seen teams underestimate the maintenance burden and regret leaving the managed path. There's no universally right answer — it depends on your team's operational experience and how much the cost difference justifies the overhead.
Head-to-head: Pinecone, Weaviate, Qdrant, pgvector
Here's where the main options actually differentiate, stripped of marketing language:
Pinecone is the easiest managed option to get started with. The serverless tier removes index configuration entirely, which is genuinely useful for prototyping. The downsides are real: no native hybrid search without an external BM25 layer, limited filter-integration transparency, and pricing that becomes painful past moderate scale. It's a good fit for teams that need to ship fast and don't want to operate infrastructure.
Weaviate is the strongest option for hybrid search out of the box. BM25 and vector search share a single query plane, and the schema system enforces data structure in a way that helps larger teams avoid payload drift. The operational complexity is higher than Pinecone. Self-hosting Weaviate on Kubernetes is manageable but not trivial.
Qdrant is our preferred recommendation for most production RAG systems. Filtered HNSW is its main technical differentiator, and it genuinely works — filtered queries don't degrade nearly as badly as on databases using naive post-filtering. The Rust implementation is efficient, memory usage is configurable through quantization, and the REST and gRPC APIs are well-designed. The managed cloud offering exists but is less mature than Pinecone's.
pgvector is a PostgreSQL extension that adds vector similarity search. It's the right choice when your data already lives in Postgres and your dataset stays under roughly 1-5 million vectors. Beyond that, query performance degrades without aggressive indexing tuning. The major advantage is zero new infrastructure and full SQL join capability, which lets you combine vector search with relational queries in a single statement. For small-to-medium RAG systems, don't dismiss it.
How to match a database to your actual workload
Before picking a database, answer these four questions:
-
Do your queries always include a filter? If yes, test filtered recall specifically, not raw recall. Qdrant and Weaviate handle this best.
-
Do your users search with keywords as well as concepts? If yes, you need native hybrid search or a deliberate two-system architecture. Weaviate has the most mature native implementation.
-
How frequently does your document corpus change? High write frequency with freshness requirements pushes you toward databases with configurable segment policies or one that supports immediate visibility for new vectors.
-
What's your team's operational capacity? If you don't have someone who can manage a stateful Kubernetes deployment, start with a managed service. You can migrate later once you understand your actual query patterns.
One take we'll stand behind: starting with pgvector in Postgres is a reasonable move for most new RAG projects. It removes a database from your stack entirely, and you can migrate to a dedicated vector store once you've validated the product and know your real query patterns. Optimizing AI infrastructure before you've confirmed product-market fit is a common mistake.
If you're building a more demanding system — multi-tenant, high write throughput, or requiring tight freshness — explore our AI infrastructure and AI development work to see how we approach these decisions with clients.
Frequently Asked Questions
What's the difference between a vector database and a regular database with a vector extension?
A dedicated vector database is optimized end-to-end for storing, indexing, and querying high-dimensional embeddings. Extensions like pgvector add vector search capability to a relational database, which is convenient but involves trade-offs in query performance at larger scales. For datasets under a few million vectors, the difference is often negligible. Past that, specialized index algorithms and storage formats in dedicated databases start to matter.
How many vectors do we typically need before performance becomes a concern?
With a flat (brute-force) index, you'll notice latency creeping up past roughly 100k vectors unless you batch queries heavily. With HNSW indexes, most databases handle tens of millions of vectors well, but filtered queries and concurrent write pressure expose limits earlier. A practical benchmark: test your actual query pattern, with your actual filters, against a representative sample before choosing.
Can we switch vector databases later without rewriting our application?
Yes, but it requires re-embedding your corpus if the embedding model changes, which is the larger effort. If you standardize on a single embedding model and expose vector search behind an abstraction layer in your application code, swapping the database underneath is a backend migration. The bigger risk is payload schema differences between databases — plan your metadata structure carefully upfront.
Is hybrid search worth the extra complexity for most RAG use cases?
For general document retrieval, yes. A well-tuned hybrid search consistently outperforms pure vector search on standard retrieval benchmarks, often by a significant margin, particularly on queries with proper nouns, codes, or specific terminology. The complexity cost is low when the database handles BM25 natively. It's only a genuine burden when you need to run a separate search system alongside the vector store.
What embedding model should we pair with our vector database?
The database choice and the embedding model are largely independent decisions. Use a model appropriate to your content type and language requirements. For English text, OpenAI's text-embedding-3-small and Cohere's embed-v3 are widely used. For multilingual or domain-specific needs, evaluate open-source models on MTEB scores for your task type. One practical constraint: check the embedding dimensionality your model outputs against what the database can index efficiently, since very high-dimensional vectors (over 3072) increase memory and storage costs.
Choosing the right vector database is an AI infrastructure decision that compounds over time — a poor choice early becomes a painful migration once your corpus grows. The Laxaar team has built and operated RAG systems across a range of industries and scales. If you're at the architecture stage and want an honest second opinion on your stack, tell us about your project or explore our custom software development services.
Working on something like this?
Get a fixed scope, timeline, and price within one business day — no obligation.


