Skip to content
AI

The LLM eval harness we wish we'd built sooner

A walkthrough of the minimal evaluation pipeline we run on every LLM project — code, structure, and the lessons that made us build it.

Anthra AI TeamAnthra AI Team
Engineering Team3 min read
The LLM eval harness we wish we'd built sooner hero image
Table of contents
  1. Why not just use [tool]?
  2. The structure
  3. Metrics that matter
  4. Minimal runner pattern
  5. Gating strategy in CI
  6. Human review loop
  7. Common anti-patterns
  8. Rollout playbook
  9. Closing
  10. Related resources

If we could give LLM teams one gift, it would be a working eval harness on day one. More than any model choice, prompt library, or retrieval tweak, the thing that determines whether an LLM project succeeds is whether the team can measure quality regressions fast.

Here's a walkthrough of the minimal harness we set up on LLM engagements. It's not fancy — under 500 lines of Python. That's the point. Simple enough to actually maintain, thorough enough to catch real regressions.

Why not just use [tool]?

We've used Ragas, Promptfoo, DeepEval, LangSmith, Humanloop, and more. They're good. So why a hand-rolled harness?

  • Off-the-shelf tools optimize for generic use cases. Your use case is specific.
  • They make simple things easy and custom things hard. Your metrics will be custom.
  • They lock you into their hosting / billing / quotas. Your CI shouldn't break because a vendor had an outage.
  • Wrapping your eval in a library you own means you understand it.

The right pattern is usually: start with a custom harness. Adopt library features (Ragas for faithfulness, Promptfoo for CI wrapping) selectively.

The structure

Keep the harness boring and explicit. A practical layout:

evals/
  datasets/
    golden.jsonl
    regression.jsonl
  prompts/
    system.txt
  evaluators/
    retrieval.py
    generation.py
    policy.py
  runners/
    run_local.py
    run_ci.py
  reports/
    latest.json

Each dataset row should include:

  • id
  • scenario
  • input
  • expected_answer (or expected properties)
  • expected_sources (for RAG)
  • severity (low/medium/high business impact)

Metrics that matter

Do not track one aggregate score only. Track per-failure-class metrics:

  • groundedness / citation quality
  • correctness for critical facts
  • schema-valid output rate
  • refusal behavior on unsupported requests
  • latency and cost per request

A single "quality score" hides regressions that matter to users.

Minimal runner pattern

Your runner should:

  1. load eval cases
  2. call the full production pipeline (or close equivalent)
  3. score each case with deterministic + model-based checks
  4. write machine-readable report artifacts
  5. fail CI when thresholds are breached

Key principle: run the same harness in local and CI contexts with only environment-level config differences.

Gating strategy in CI

Recommended gates:

  • zero tolerance for critical policy violations
  • max allowed drop in high-severity accuracy
  • max allowed increase in per-request cost
  • max latency regression at p95

Start with conservative thresholds, then tighten as confidence grows.

Human review loop

Automation catches drift; human review catches blind spots.

Weekly review should include:

  • top 10 new failures by impact
  • error taxonomy updates
  • prompt/routing changes and rationale
  • candidate additions to regression set

Every major production incident should create at least one permanent regression test.

Common anti-patterns

  • evaluating only with synthetic prompts
  • changing prompts and model simultaneously (no attribution)
  • relying on one LLM-judge score with no deterministic checks
  • storing no historical reports, so trend analysis is impossible
  • allowing "temporary" bypasses in CI that become permanent

Rollout playbook

  1. Create 50-100 high-signal eval cases.
  2. Add local runner and report artifact generation.
  3. Wire CI in warning mode for one week.
  4. Move to fail-on-regression for high-severity metrics.
  5. Expand dataset as product surface grows.

The goal is not perfect scoring. The goal is catching obvious quality regressions before users do.

Closing

A simple, owned eval harness is one of the highest-ROI investments in LLM delivery. It turns model iteration from guesswork into engineering.

Tags

Anthra AI Team

Anthra AI Team

Engineering Team

Collective posts from the engineers at Anthra AI. We write about what we build.

More posts by Anthra AI Team

Share this article

Share

Get insights like this weekly

Product engineering notes on AI, data, and infrastructure - no fluff.

Previous post

Fine-tuning LLMs in 2026: when it's worth the effort

AI