RAG Evaluation: The Tests to Run Before Shipping LLMs

Most RAG systems ship without evaluation. The team hacks together a prompt, tests a few queries by hand, ships, and waits for bug reports. When the inevitable "it's hallucinating" complaints arrive, they can't systematically fix anything because they have no baseline.

This is the evaluation harness we set up on day one of any RAG engagement. It's not exotic — it's discipline. Skip it at your peril.

What we're evaluating

A RAG system has two components, and you need to evaluate both independently:

Retrieval — given a query, did we fetch the right documents?
Generation — given correct documents, did the LLM produce a correct, grounded answer?

If you only measure the end-to-end output, you can't tell which part is broken. When accuracy drops, is it because retrieval is fetching garbage, or because the model is ignoring good context? You need to know.

→The 80/20 of RAG quality

In our experience, 70% of RAG quality problems are retrieval problems. If you only have time for one thing, evaluate retrieval.

Step 1: Build a golden dataset

Before any evaluation, you need ground truth. This is the part teams skip — and it's the most important part.

A golden dataset is a collection of (query, expected_documents, expected_answer) tuples, curated by humans who understand the domain.

For our recent legal tech engagement, we built 200 such tuples across four query types:

Factual lookups ("What is the reporting threshold for section 13G?")
Comparative questions ("How does rule X differ from rule Y?")
Multi-hop reasoning ("If A changed and B references A, what's the effect?")
Adversarial / out-of-scope ("Who won the Super Bowl?")

The size matters less than the diversity. 100 good examples beat 1000 mediocre ones. Aim for coverage of your real user queries, including edge cases.

💡How to bootstrap a golden dataset

If you don't have production traffic yet: have 2-3 domain experts spend a day each generating 50-100 queries with expected answers. If you have traffic: sample 200 production queries, have humans answer and document them, curate.

Step 2: Retrieval metrics

For each query in your golden set, your retrieval system returns a ranked list of documents. Measure it with:

Recall@K

Of the documents that should have been retrieved (the ground truth set), what fraction appear in the top K?

Track at least Recall@5 and Recall@10. Low recall means generation never had a chance.

Precision@K

How many retrieved documents are actually relevant?

Precision matters when your context window is expensive or noisy. Low precision increases hallucination risk.

MRR / nDCG

Ranking quality matters, not just inclusion. If relevant documents are always near rank 8-10, answer quality still degrades.

Step 3: Generation metrics

Once retrieval is acceptable, evaluate generation with clear criteria:

faithfulness: does the answer stay within provided context?
correctness: is the answer factually correct against ground truth?
completeness: does it answer the full question, not just part?
citation quality: are source references specific and verifiable?
format compliance: does output follow required schema/template?

Use deterministic checks where possible, and LLM-as-judge only as a supplement.

Step 4: Failure taxonomy

Classify every failure into explicit buckets:

retrieval miss
retrieval noise / wrong ranking
generation fabrication
citation mismatch
prompt routing/tool error

This turns "RAG quality is down" into actionable engineering work.

Step 5: Regression gates

Before each release, compare against baseline:

no critical-category regression
retrieval recall above floor (for example, Recall@10 >= 0.85)
faithfulness and correctness above agreed thresholds
latency and cost within budget envelopes

If a release fails gates, block rollout or limit to a canary cohort.

Step 6: Production monitoring

Offline evals are necessary but insufficient. In production, monitor:

answer quality feedback rate
citation click-through/verification behavior
hallucination reports by scenario
retrieval miss indicators for zero-result or low-confidence queries
latency and cost distribution by route/model

Feed production failures back into the golden dataset weekly.

Practical thresholds to start with

Initial thresholds should be realistic, then tightened over time:

Recall@10 target: 0.80-0.90 depending on domain difficulty
faithfulness: > 0.90 for regulated domains
citation-valid answers: > 0.85
high-severity query pass rate: > 0.95

The exact values matter less than consistent measurement and trend discipline.

Common mistakes

evaluating only end-to-end and not splitting retrieval vs generation
using tiny eval sets with no edge-case coverage
shipping prompt/model changes without baseline comparison
no ownership for updating eval datasets
treating eval as a one-time setup instead of a release process

Closing

RAG quality is not a prompt trick. It is an engineering system. Teams that ship with explicit evaluation gates build trust faster, reduce regressions, and iterate with confidence.

Capabilities: AI-Native Products and Data Platform
Case study: LegalTech RAG system
Deep dive: LLM eval harness walkthrough

Anthra AI Team

Engineering Team

Collective posts from the engineers at Anthra AI. We write about what we build.

Share this article

Get insights like this weekly

Product engineering notes on AI, data, and infrastructure - no fluff.

When ClickHouse beats Postgres for analytics — and when it doesn't

Data

Kafka topic design: the 5 mistakes we see most often

Data

RAG evaluation: the tests we run before shipping any LLM feature

What we're evaluating

Step 1: Build a golden dataset

Step 2: Retrieval metrics

Recall@K

Precision@K

MRR / nDCG

Step 3: Generation metrics

Step 4: Failure taxonomy

Step 5: Regression gates

Step 6: Production monitoring

Practical thresholds to start with

Common mistakes

Closing

Tags

Anthra AI Team

Share this article

Get insights like this weekly

Related posts

The LLM eval harness we wish we'd built sooner

Fine-tuning LLMs in 2026: when it's worth the effort

Choosing a vector database in 2026: a practical comparison

Need help building this?

What we're evaluating

Step 1: Build a golden dataset

Step 2: Retrieval metrics

Recall@K

Precision@K

MRR / nDCG

Step 3: Generation metrics

Step 4: Failure taxonomy

Step 5: Regression gates

Step 6: Production monitoring

Practical thresholds to start with

Common mistakes

Closing

Related resources

Tags

Anthra AI Team

Share this article

Get insights like this weekly

Related posts

The LLM eval harness we wish we'd built sooner

Fine-tuning LLMs in 2026: when it's worth the effort

Choosing a vector database in 2026: a practical comparison

Need help building this?