Skip to content
AI

RAG evaluation: the tests we run before shipping any LLM feature

A production-grade evaluation harness for retrieval-augmented generation. Golden datasets, LLM-as-judge, retrieval metrics, and regression gates.

Anthra AI TeamAnthra AI Team
Engineering Team5 min read
RAG evaluation: the tests we run before shipping any LLM feature hero image
Table of contents
  1. What we're evaluating
  2. Step 1: Build a golden dataset
  3. Step 2: Retrieval metrics
  4. Recall@K
  5. Precision@K
  6. MRR / nDCG
  7. Step 3: Generation metrics
  8. Step 4: Failure taxonomy
  9. Step 5: Regression gates
  10. Step 6: Production monitoring
  11. Practical thresholds to start with
  12. Common mistakes
  13. Closing
  14. Related resources

Most RAG systems ship without evaluation. The team hacks together a prompt, tests a few queries by hand, ships, and waits for bug reports. When the inevitable "it's hallucinating" complaints arrive, they can't systematically fix anything because they have no baseline.

This is the evaluation harness we set up on day one of any RAG engagement. It's not exotic — it's discipline. Skip it at your peril.

What we're evaluating

A RAG system has two components, and you need to evaluate both independently:

  1. Retrieval — given a query, did we fetch the right documents?
  2. Generation — given correct documents, did the LLM produce a correct, grounded answer?

If you only measure the end-to-end output, you can't tell which part is broken. When accuracy drops, is it because retrieval is fetching garbage, or because the model is ignoring good context? You need to know.

The 80/20 of RAG quality

In our experience, 70% of RAG quality problems are retrieval problems. If you only have time for one thing, evaluate retrieval.

Step 1: Build a golden dataset

Before any evaluation, you need ground truth. This is the part teams skip — and it's the most important part.

A golden dataset is a collection of (query, expected_documents, expected_answer) tuples, curated by humans who understand the domain.

For our recent legal tech engagement, we built 200 such tuples across four query types:

  • Factual lookups ("What is the reporting threshold for section 13G?")
  • Comparative questions ("How does rule X differ from rule Y?")
  • Multi-hop reasoning ("If A changed and B references A, what's the effect?")
  • Adversarial / out-of-scope ("Who won the Super Bowl?")

The size matters less than the diversity. 100 good examples beat 1000 mediocre ones. Aim for coverage of your real user queries, including edge cases.

💡How to bootstrap a golden dataset

If you don't have production traffic yet: have 2-3 domain experts spend a day each generating 50-100 queries with expected answers. If you have traffic: sample 200 production queries, have humans answer and document them, curate.

Step 2: Retrieval metrics

For each query in your golden set, your retrieval system returns a ranked list of documents. Measure it with:

Recall@K

Of the documents that should have been retrieved (the ground truth set), what fraction appear in the top K?

Track at least Recall@5 and Recall@10. Low recall means generation never had a chance.

Precision@K

How many retrieved documents are actually relevant?

Precision matters when your context window is expensive or noisy. Low precision increases hallucination risk.

MRR / nDCG

Ranking quality matters, not just inclusion. If relevant documents are always near rank 8-10, answer quality still degrades.

Step 3: Generation metrics

Once retrieval is acceptable, evaluate generation with clear criteria:

  • faithfulness: does the answer stay within provided context?
  • correctness: is the answer factually correct against ground truth?
  • completeness: does it answer the full question, not just part?
  • citation quality: are source references specific and verifiable?
  • format compliance: does output follow required schema/template?

Use deterministic checks where possible, and LLM-as-judge only as a supplement.

Step 4: Failure taxonomy

Classify every failure into explicit buckets:

  1. retrieval miss
  2. retrieval noise / wrong ranking
  3. generation fabrication
  4. citation mismatch
  5. prompt routing/tool error

This turns "RAG quality is down" into actionable engineering work.

Step 5: Regression gates

Before each release, compare against baseline:

  • no critical-category regression
  • retrieval recall above floor (for example, Recall@10 >= 0.85)
  • faithfulness and correctness above agreed thresholds
  • latency and cost within budget envelopes

If a release fails gates, block rollout or limit to a canary cohort.

Step 6: Production monitoring

Offline evals are necessary but insufficient. In production, monitor:

  • answer quality feedback rate
  • citation click-through/verification behavior
  • hallucination reports by scenario
  • retrieval miss indicators for zero-result or low-confidence queries
  • latency and cost distribution by route/model

Feed production failures back into the golden dataset weekly.

Practical thresholds to start with

Initial thresholds should be realistic, then tightened over time:

  • Recall@10 target: 0.80-0.90 depending on domain difficulty
  • faithfulness: > 0.90 for regulated domains
  • citation-valid answers: > 0.85
  • high-severity query pass rate: > 0.95

The exact values matter less than consistent measurement and trend discipline.

Common mistakes

  • evaluating only end-to-end and not splitting retrieval vs generation
  • using tiny eval sets with no edge-case coverage
  • shipping prompt/model changes without baseline comparison
  • no ownership for updating eval datasets
  • treating eval as a one-time setup instead of a release process

Closing

RAG quality is not a prompt trick. It is an engineering system. Teams that ship with explicit evaluation gates build trust faster, reduce regressions, and iterate with confidence.

Tags

Anthra AI Team

Anthra AI Team

Engineering Team

Collective posts from the engineers at Anthra AI. We write about what we build.

More posts by Anthra AI Team

Share this article

Share

Get insights like this weekly

Product engineering notes on AI, data, and infrastructure - no fluff.

Previous post

When ClickHouse beats Postgres for analytics — and when it doesn't

Data

Next post

Kafka topic design: the 5 mistakes we see most often

Data