Most RAG systems ship without evaluation. The team hacks together a prompt, tests a few queries by hand, ships, and waits for bug reports. When the inevitable "it's hallucinating" complaints arrive, they can't systematically fix anything because they have no baseline.
This is the evaluation harness we set up on day one of any RAG engagement. It's not exotic — it's discipline. Skip it at your peril.
What we're evaluating
A RAG system has two components, and you need to evaluate both independently:
- Retrieval — given a query, did we fetch the right documents?
- Generation — given correct documents, did the LLM produce a correct, grounded answer?
If you only measure the end-to-end output, you can't tell which part is broken. When accuracy drops, is it because retrieval is fetching garbage, or because the model is ignoring good context? You need to know.
In our experience, 70% of RAG quality problems are retrieval problems. If you only have time for one thing, evaluate retrieval.
Step 1: Build a golden dataset
Before any evaluation, you need ground truth. This is the part teams skip — and it's the most important part.
A golden dataset is a collection of (query, expected_documents, expected_answer) tuples, curated by humans who understand the domain.
For our recent legal tech engagement, we built 200 such tuples across four query types:
- Factual lookups ("What is the reporting threshold for section 13G?")
- Comparative questions ("How does rule X differ from rule Y?")
- Multi-hop reasoning ("If A changed and B references A, what's the effect?")
- Adversarial / out-of-scope ("Who won the Super Bowl?")
The size matters less than the diversity. 100 good examples beat 1000 mediocre ones. Aim for coverage of your real user queries, including edge cases.
If you don't have production traffic yet: have 2-3 domain experts spend a day each generating 50-100 queries with expected answers. If you have traffic: sample 200 production queries, have humans answer and document them, curate.
Step 2: Retrieval metrics
For each query in your golden set, your retrieval system returns a ranked list of documents. Measure it with:
Recall@K
Of the documents that should have been retrieved (the ground truth set), what fraction appear in the top K?
Track at least Recall@5 and Recall@10. Low recall means generation never had a chance.
Precision@K
How many retrieved documents are actually relevant?
Precision matters when your context window is expensive or noisy. Low precision increases hallucination risk.
MRR / nDCG
Ranking quality matters, not just inclusion. If relevant documents are always near rank 8-10, answer quality still degrades.
Step 3: Generation metrics
Once retrieval is acceptable, evaluate generation with clear criteria:
- faithfulness: does the answer stay within provided context?
- correctness: is the answer factually correct against ground truth?
- completeness: does it answer the full question, not just part?
- citation quality: are source references specific and verifiable?
- format compliance: does output follow required schema/template?
Use deterministic checks where possible, and LLM-as-judge only as a supplement.
Step 4: Failure taxonomy
Classify every failure into explicit buckets:
- retrieval miss
- retrieval noise / wrong ranking
- generation fabrication
- citation mismatch
- prompt routing/tool error
This turns "RAG quality is down" into actionable engineering work.
Step 5: Regression gates
Before each release, compare against baseline:
- no critical-category regression
- retrieval recall above floor (for example,
Recall@10 >= 0.85) - faithfulness and correctness above agreed thresholds
- latency and cost within budget envelopes
If a release fails gates, block rollout or limit to a canary cohort.
Step 6: Production monitoring
Offline evals are necessary but insufficient. In production, monitor:
- answer quality feedback rate
- citation click-through/verification behavior
- hallucination reports by scenario
- retrieval miss indicators for zero-result or low-confidence queries
- latency and cost distribution by route/model
Feed production failures back into the golden dataset weekly.
Practical thresholds to start with
Initial thresholds should be realistic, then tightened over time:
Recall@10target: 0.80-0.90 depending on domain difficulty- faithfulness: > 0.90 for regulated domains
- citation-valid answers: > 0.85
- high-severity query pass rate: > 0.95
The exact values matter less than consistent measurement and trend discipline.
Common mistakes
- evaluating only end-to-end and not splitting retrieval vs generation
- using tiny eval sets with no edge-case coverage
- shipping prompt/model changes without baseline comparison
- no ownership for updating eval datasets
- treating eval as a one-time setup instead of a release process
Closing
RAG quality is not a prompt trick. It is an engineering system. Teams that ship with explicit evaluation gates build trust faster, reduce regressions, and iterate with confidence.
Related resources
- Capabilities: AI-Native Products and Data Platform
- Case study: LegalTech RAG system
- Deep dive: LLM eval harness walkthrough
Tags
Anthra AI Team
Engineering Team
Collective posts from the engineers at Anthra AI. We write about what we build.
More posts by Anthra AI TeamShare this article
Get insights like this weekly
Product engineering notes on AI, data, and infrastructure - no fluff.
Previous post
When ClickHouse beats Postgres for analytics — and when it doesn't
Data
Next post
Kafka topic design: the 5 mistakes we see most often
Data