Build a Minimal LLM Evaluation Harness — With Code

If we could give LLM teams one gift, it would be a working eval harness on day one. More than any model choice, prompt library, or retrieval tweak, the thing that determines whether an LLM project succeeds is whether the team can measure quality regressions fast.

Here's a walkthrough of the minimal harness we set up on LLM engagements. It's not fancy — under 500 lines of Python. That's the point. Simple enough to actually maintain, thorough enough to catch real regressions.

Why not just use [tool]?

We've used Ragas, Promptfoo, DeepEval, LangSmith, Humanloop, and more. They're good. So why a hand-rolled harness?

Off-the-shelf tools optimize for generic use cases. Your use case is specific.
They make simple things easy and custom things hard. Your metrics will be custom.
They lock you into their hosting / billing / quotas. Your CI shouldn't break because a vendor had an outage.
Wrapping your eval in a library you own means you understand it.

The right pattern is usually: start with a custom harness. Adopt library features (Ragas for faithfulness, Promptfoo for CI wrapping) selectively.

The structure

Keep the harness boring and explicit. A practical layout:

evals/
  datasets/
    golden.jsonl
    regression.jsonl
  prompts/
    system.txt
  evaluators/
    retrieval.py
    generation.py
    policy.py
  runners/
    run_local.py
    run_ci.py
  reports/
    latest.json

Each dataset row should include:

id
scenario
input
expected_answer (or expected properties)
expected_sources (for RAG)
severity (low/medium/high business impact)

Metrics that matter

Do not track one aggregate score only. Track per-failure-class metrics:

groundedness / citation quality
correctness for critical facts
schema-valid output rate
refusal behavior on unsupported requests
latency and cost per request

A single "quality score" hides regressions that matter to users.

Minimal runner pattern

Your runner should:

load eval cases
call the full production pipeline (or close equivalent)
score each case with deterministic + model-based checks
write machine-readable report artifacts
fail CI when thresholds are breached

Key principle: run the same harness in local and CI contexts with only environment-level config differences.

Gating strategy in CI

Recommended gates:

zero tolerance for critical policy violations
max allowed drop in high-severity accuracy
max allowed increase in per-request cost
max latency regression at p95

Start with conservative thresholds, then tighten as confidence grows.

Human review loop

Automation catches drift; human review catches blind spots.

Weekly review should include:

top 10 new failures by impact
error taxonomy updates
prompt/routing changes and rationale
candidate additions to regression set

Every major production incident should create at least one permanent regression test.

Common anti-patterns

evaluating only with synthetic prompts
changing prompts and model simultaneously (no attribution)
relying on one LLM-judge score with no deterministic checks
storing no historical reports, so trend analysis is impossible
allowing "temporary" bypasses in CI that become permanent

Rollout playbook

Create 50-100 high-signal eval cases.
Add local runner and report artifact generation.
Wire CI in warning mode for one week.
Move to fail-on-regression for high-severity metrics.
Expand dataset as product surface grows.

The goal is not perfect scoring. The goal is catching obvious quality regressions before users do.

Closing

A simple, owned eval harness is one of the highest-ROI investments in LLM delivery. It turns model iteration from guesswork into engineering.

Capabilities: AI-Native Products and Product Analytics
Case study: LegalTech RAG accuracy improvements
Deep dive: Fine-tuning LLMs: when worth it

Anthra AI Team

Engineering Team

Collective posts from the engineers at Anthra AI. We write about what we build.

Share this article

Get insights like this weekly

Product engineering notes on AI, data, and infrastructure - no fluff.

Fine-tuning LLMs in 2026: when it's worth the effort

The LLM eval harness we wish we'd built sooner

Why not just use [tool]?

The structure

Metrics that matter

Minimal runner pattern

Gating strategy in CI

Human review loop

Common anti-patterns

Rollout playbook

Closing

Tags

Anthra AI Team

Share this article

Get insights like this weekly

Related posts

Fine-tuning LLMs in 2026: when it's worth the effort

RAG evaluation: the tests we run before shipping any LLM feature

Choosing a vector database in 2026: a practical comparison

Need help building this?

Why not just use [tool]?

The structure

Metrics that matter

Minimal runner pattern

Gating strategy in CI

Human review loop

Common anti-patterns

Rollout playbook

Closing

Related resources

Tags

Anthra AI Team

Share this article

Get insights like this weekly

Related posts

Fine-tuning LLMs in 2026: when it's worth the effort

RAG evaluation: the tests we run before shipping any LLM feature

Choosing a vector database in 2026: a practical comparison

Need help building this?