LLM Evaluation Systems That Catch Regressions

A model provider ships a new version. Your prompt that worked perfectly last week starts producing subtly wrong outputs, and you find out three days later from a user complaint. A well-meaning engineer tweaks the system prompt to fix one issue and quietly degrades five others. Without an evaluation system, you have no way to see any of this coming.

LLM evaluation is the practice of systematically measuring whether your application produces correct, safe, and consistent outputs across a representative set of inputs. Think of it as unit and integration tests for probabilistic systems: not a perfect guarantee, but the closest thing to one.

Most teams build evaluation as an afterthought, after something breaks badly in production. At Laxaar, we wire up a basic eval harness before the first production deployment and expand it as the application matures. The earlier you start, the cheaper each regression is to fix.

What you'll learn

The eval taxonomy: what you're actually measuring
Building a golden dataset
Deterministic evaluators: the underrated foundation
LLM-as-judge: when and how
Regression detection across model and prompt versions
CI integration: making evals automatic
Production monitoring vs. offline evaluation
Frequently Asked Questions

The eval taxonomy: what you're actually measuring

LLM evaluations fall into three categories, and conflating them leads to confusion about what your eval suite actually tells you.

Correctness evals measure whether the model output matches a known-good answer. There's a ground truth, so you can objectively score against it. They work well for extraction tasks, classification, code generation, and anything with a deterministic right answer.

Quality evals measure subjective dimensions — helpfulness, coherence, tone, completeness — where no single answer is correct. These require human judgment or a proxy for it (often another LLM). They're noisier but necessary for open-ended generation tasks.

Safety evals test whether the model correctly handles inputs it shouldn't act on. Jailbreaks, off-topic requests, sensitive content. These are adversarial by design and require a separate dataset of problem inputs.

Eval type	Ground truth	Typical scorer	Signal reliability
Correctness	Known expected output	Exact match, regex, schema validation	High
Quality	Human preference	LLM-as-judge, human raters	Medium
Safety	Adversarial inputs	Binary pass/fail classifier	High for refusals

Start with correctness evals. They're cheap, fast, and high-signal. Add quality evals when correctness coverage isn't sufficient for your use case.

Building a golden dataset

A golden dataset is a curated set of (input, expected output) pairs that represents the full distribution of your application's inputs, including edge cases. It's the foundation of every eval you'll run.

Good golden datasets have three properties:

Coverage. They include typical inputs, edge cases, and the kinds of inputs that have caused failures before.
Correctness. Expected outputs are verified by a human, not generated by the same model you're evaluating.
Stability. The dataset doesn't change frequently; additions are versioned.

How many examples? For most applications, 100–300 examples cover the important cases. Diminishing returns set in quickly after that for correctness evals. Quality and safety evals often need more because the failure modes are more diverse.

import json
from dataclasses import dataclass, asdict
from pathlib import Path
from datetime import date

@dataclass
class EvalCase:
    id: str
    input: str
    expected_output: str  # or expected_fields for extraction tasks
    category: str  # e.g., "typical", "edge_case", "adversarial"
    notes: str = ""
    added_date: str = str(date.today())

class GoldenDataset:
    def __init__(self, path: str):
        self.path = Path(path)
        self.cases: list[EvalCase] = []
        if self.path.exists():
            self._load()

    def _load(self):
        with open(self.path) as f:
            raw = json.load(f)
        self.cases = [EvalCase(**c) for c in raw["cases"]]

    def add_case(self, case: EvalCase):
        # Check for duplicate IDs
        existing_ids = {c.id for c in self.cases}
        if case.id in existing_ids:
            raise ValueError(f"Case ID {case.id!r} already exists.")
        self.cases.append(case)

    def save(self):
        self.path.parent.mkdir(parents=True, exist_ok=True)
        with open(self.path, "w") as f:
            json.dump({"cases": [asdict(c) for c in self.cases]}, f, indent=2)

    def filter_by_category(self, category: str) -> list[EvalCase]:
        return [c for c in self.cases if c.category == category]

Seed the dataset using your application logs. Find the inputs your application actually receives, sample them across the distribution, and have a human verify the expected outputs. Don't use the model to generate expected outputs for its own eval dataset. You'll bake in the model's biases and miss the cases where it's systematically wrong.

Deterministic evaluators: the underrated foundation

Before reaching for LLM-as-judge, ask whether a deterministic evaluator can do the job. Deterministic evaluators are fast, cheap, perfectly consistent, and trivial to integrate into CI. Teams underuse them because they seem too simple.

The evaluators that cover more ground than you'd expect:

import json
import re
from typing import Any

def exact_match(actual: str, expected: str) -> float:
    """Binary: 1.0 if strings match after normalization, else 0.0."""
    normalize = lambda s: s.strip().lower()
    return 1.0 if normalize(actual) == normalize(expected) else 0.0

def json_field_match(actual: str, expected_fields: dict[str, Any]) -> float:
    """Score JSON output against expected field values."""
    try:
        parsed = json.loads(actual)
    except json.JSONDecodeError:
        return 0.0

    scores = []
    for field, expected_value in expected_fields.items():
        actual_value = parsed.get(field)
        if isinstance(expected_value, str):
            scores.append(1.0 if str(actual_value).strip().lower() == expected_value.strip().lower() else 0.0)
        elif expected_value is None:
            scores.append(1.0 if actual_value is None else 0.0)
        else:
            scores.append(1.0 if actual_value == expected_value else 0.0)

    return sum(scores) / len(scores) if scores else 0.0

def contains_required_fields(actual: str, required_fields: list[str]) -> float:
    """Check that JSON output contains all required non-null fields."""
    try:
        parsed = json.loads(actual)
    except json.JSONDecodeError:
        return 0.0

    present = sum(1 for f in required_fields if parsed.get(f) is not None)
    return present / len(required_fields)

def regex_match(actual: str, pattern: str) -> float:
    """Binary: 1.0 if output matches pattern."""
    return 1.0 if re.search(pattern, actual, re.IGNORECASE) else 0.0

def is_valid_schema(actual: str, schema: dict) -> float:
    """Validate JSON output against a JSON Schema."""
    import jsonschema
    try:
        parsed = json.loads(actual)
        jsonschema.validate(parsed, schema)
        return 1.0
    except (json.JSONDecodeError, jsonschema.ValidationError):
        return 0.0

For extraction tasks, json_field_match is often sufficient. For classification, exact_match after normalization works. For code generation, you can run the code and check that it executes without error and produces the right output. None of these require an LLM.

LLM-as-judge: when and how

LLM-as-judge is an evaluation approach where a second LLM scores the output of your application. It's the right tool for quality dimensions that are hard to specify with rules: tone, helpfulness, factual consistency with a reference document, absence of hallucination.

The risks are real. LLM judges are biased toward their own outputs, prefer longer responses, and can be inconsistent on borderline cases. These problems are manageable with the right setup.

from anthropic import Anthropic

client = Anthropic()

JUDGE_SYSTEM = """
You are an impartial evaluator assessing the quality of an AI assistant's response.

You will be given:
- TASK: what the assistant was asked to do
- CONTEXT: any source documents or data the assistant had access to
- RESPONSE: the assistant's actual response

Score the response on the following dimensions (each 1–5):
1. Accuracy: Does the response correctly answer the task? Are facts consistent with the context?
2. Completeness: Does the response address all parts of the task?
3. Groundedness: Does the response stay within the information in the context, or does it add unsupported claims?

Return your evaluation as JSON with keys: accuracy, completeness, groundedness, reasoning.
The reasoning field should be 2–3 sentences explaining your scores.
Return JSON only — no preamble.
"""

def llm_judge(task: str, context: str, response: str) -> dict:
    user_message = f"""TASK: {task}

CONTEXT:
{context}

RESPONSE:
{response}"""

    result = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=512,
        system=JUDGE_SYSTEM,
        messages=[{"role": "user", "content": user_message}],
    )

    return json.loads(result.content[0].text)

A few practices that make LLM-as-judge more reliable:

Use a different model than the one you're evaluating. Judges tend to be lenient on their own outputs.
Break evaluation into specific, narrow criteria rather than asking for an overall score. Narrow criteria produce more consistent scores.
Include reference outputs when you have them. "Is this response better, worse, or equivalent to this reference?" is more consistent than an absolute score.
Run the same eval case 3 times and average. LLM judges have real variance. Averaging tames it.

Regression detection across model and prompt versions

The core eval workflow is comparison: run eval suite on version A, make a change, run eval suite on version B, compare. Regression is when B's score is meaningfully lower than A's on any important dimension.

"Meaningfully lower" needs a threshold. Don't set it at zero. Natural variance in model outputs means a tiny score difference on a small dataset isn't significant. A reasonable default: flag a regression if aggregate score drops more than 3% or if any single category drops more than 10%.

from dataclasses import dataclass
import statistics

@dataclass
class EvalResult:
    case_id: str
    score: float
    category: str

def compare_versions(
    baseline_results: list[EvalResult],
    candidate_results: list[EvalResult],
    regression_threshold: float = 0.03,
    category_threshold: float = 0.10,
) -> dict:
    baseline_by_id = {r.case_id: r for r in baseline_results}
    candidate_by_id = {r.case_id: r for r in candidate_results}

    shared_ids = set(baseline_by_id) & set(candidate_by_id)

    baseline_scores = [baseline_by_id[i].score for i in shared_ids]
    candidate_scores = [candidate_by_id[i].score for i in shared_ids]

    baseline_mean = statistics.mean(baseline_scores)
    candidate_mean = statistics.mean(candidate_scores)
    overall_delta = candidate_mean - baseline_mean

    # Per-category analysis
    categories = {r.category for r in baseline_results}
    category_deltas = {}
    for cat in categories:
        cat_baseline = [baseline_by_id[i].score for i in shared_ids if baseline_by_id[i].category == cat]
        cat_candidate = [candidate_by_id[i].score for i in shared_ids if candidate_by_id[i].category == cat]
        if cat_baseline and cat_candidate:
            category_deltas[cat] = statistics.mean(cat_candidate) - statistics.mean(cat_baseline)

    regressions = {
        "overall": overall_delta < -regression_threshold,
        "categories": {cat: delta < -category_threshold for cat, delta in category_deltas.items()},
    }

    return {
        "baseline_mean": baseline_mean,
        "candidate_mean": candidate_mean,
        "overall_delta": overall_delta,
        "category_deltas": category_deltas,
        "regressions_detected": any(regressions["overall"] and any(regressions["categories"].values())),
        "detail": regressions,
    }

Version tagging matters here. Tag every eval run with the model version, prompt version, and retrieval pipeline version. When a regression appears, you need to know which dimension changed. Teams that lump these together can't isolate the root cause.

CI integration: making evals automatic

An eval suite that runs manually gets run inconsistently. Wire it into your CI pipeline so every pull request that changes a prompt, system configuration, or retrieval pipeline automatically runs the eval suite.

The practical setup with GitHub Actions:

# .github/workflows/llm-eval.yml
name: LLM Evaluation

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/retrieval/**'
      - 'src/agents/**'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run eval suite
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python -m evals.run \
            --dataset evals/datasets/golden.json \
            --output evals/results/pr-${{ github.event.number }}.json \
            --baseline evals/results/baseline.json \
            --fail-on-regression

      - name: Comment results on PR
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('evals/results/pr-${{ github.event.number }}.json'));
            const body = `## LLM Eval Results\n\n` +
              `Baseline: ${results.baseline_mean.toFixed(3)} | Candidate: ${results.candidate_mean.toFixed(3)} | Delta: ${results.overall_delta > 0 ? '+' : ''}${results.overall_delta.toFixed(3)}\n\n` +
              (results.regressions_detected ? '**Regression detected. Review before merging.**' : 'No regression detected.');
            github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body });

Keep the CI eval suite fast. If it takes 20 minutes, engineers will start skipping it. Use a subset of your golden dataset (50–100 cases) for the PR gate and run the full suite nightly. Cache model responses where possible; deterministic evaluators don't need a live model call.

Production monitoring vs. offline evaluation

Offline evaluation (running your eval suite against a fixed dataset) tells you how the system performs on inputs you anticipated. Production monitoring tells you how it performs on inputs you didn't.

Both are necessary. The feedback loop goes both ways: production failures should be added to the golden dataset so you don't regress on the same failure twice.

The production monitoring setup Laxaar typically builds:

Structured logging. Every LLM call logs the input, output, model version, latency, and token count.
Sampled judge scoring. Not every call, but 1–5% sampled by category, routed to a lightweight judge.
Anomaly detection on aggregate metrics. If the mean judge score drops 10% week-over-week, that's a signal worth investigating.
A failure review process. Weekly review of low-scoring outputs, with a decision on whether each should become a new eval case.

For context on the infrastructure that supports this monitoring at scale, see our AI infrastructure for production LLM apps guide. And if you're building the retrieval layer that feeds your eval inputs, AI data pipelines covers the storage and indexing patterns we use at Laxaar.

Frequently Asked Questions

How many eval cases do I actually need to get started?

Fifty is enough to start. The goal of the first eval suite isn't exhaustive coverage. It's catching obvious regressions quickly. Pick 30 typical cases, 10 edge cases, and 10 cases that have previously caused failures. Run it, baseline it, and add to it over time. A 50-case suite that runs on every PR is worth more than a 500-case suite that nobody runs.

Can I use the same model to generate my golden dataset expected outputs?

Not reliably. The model will produce outputs consistent with its current behavior, including its current failure modes. When the model improves (or changes), its outputs will diverge from the dataset in ways that look like regressions even when the new outputs are actually better. Human-verified expected outputs are more stable and give you a fixed reference point to measure against.

How do I evaluate open-ended generation where there's no single right answer?

Use LLM-as-judge with specific, narrow criteria (see the section above). For content quality, evaluate dimensions like "does the response stay on topic," "does it contain factual errors," and "does it follow the requested format" rather than asking for an overall score. Pair this with a human spot-check process: review 10–20 outputs manually each week to calibrate your judge against human judgment.

What's a reasonable eval budget for a production team?

Plan for 1–5% of your inference cost. A team running $10k/month in inference should budget $100–500/month for eval runs. Most of that goes toward LLM-as-judge calls. Deterministic evaluators are essentially free. If cost is a constraint, prioritize deterministic evals and use LLM-as-judge only for the cases where deterministic scoring isn't possible.

Should evals block deployment or just warn?

Block on clear regressions (overall score drops more than your threshold, or a safety eval fails). Warn on ambiguous cases (small deltas, single-category drops that might be noise). The goal is to make the CI check useful without making it too sensitive to noise. A gate that trips on every minor fluctuation trains engineers to ignore it.

How do I handle model provider updates that change output behavior?

Pin your model versions explicitly (e.g., gpt-4o-2024-08-06, not gpt-4o). Run your eval suite against the new version before updating the pin in production. Treat a model version bump like any other dependency update: test it, compare results, and ship it as a deliberate change rather than letting the provider update it under you. Most providers support version pinning for at least several months after a new version ships.

Evaluation is the discipline that separates LLM prototypes from production systems. If you want to build AI applications that you can ship with confidence, get in touch with Laxaar. Evaluation infrastructure is a first-class part of every system we build.