Evaluating AI Agents: Frameworks and Metrics

Most agent projects feel like they're working right up until they're not. The demos pass. The hand-picked test cases pass. Then a real user runs the agent on a task it almost handles, and it loops, hallucinates a tool call, or stops with a polite apology three steps short of done. Agent evaluation is the discipline that catches those failures before they reach production, and it's harder than evaluating a classifier or a RAG pipeline.

Agent evaluation means measuring whether an agent completes its assigned tasks correctly, reliably, and within acceptable resource limits. The "correctly" part is what makes it genuinely difficult: agents take multi-step paths, and intermediate steps can be wrong even when the final answer looks right. A broken tool call that happens to recover doesn't mean your evaluation should pass.

At Laxaar we've shipped agents for document workflows, customer support routing, and research automation. The patterns below come from what we've learned running these systems against real workloads, not synthetic benchmarks.

What you'll learn

Why agent evaluation is different from model evaluation
Core metrics every agent system needs
Evaluation frameworks compared
Building a task trace evaluator
LLM-as-judge for trajectory scoring
Running evaluation in CI
Frequently Asked Questions

Why agent evaluation is different from model evaluation

Model evaluation is point-in-time: you pass in a prompt, you get a response, you score it. Agent evaluation is sequential: the agent takes an action, the environment responds, the agent takes another action based on that response, and errors compound across steps.

This creates three problems that don't exist in standard LLM evaluation:

State dependency. Step 5's correctness depends on steps 1–4. You can't evaluate steps independently without losing the context that makes them meaningful.

Trajectory variation. Two agents can reach the same correct answer via completely different tool-call sequences. Your evaluation needs to handle this without penalizing valid alternative paths.

Non-determinism. Temperature, tool call ordering, and model version updates all introduce run-to-run variation. A single eval run tells you almost nothing; you need statistical aggregation over multiple runs.

The implication: you need both outcome metrics (did the task complete successfully?) and trajectory metrics (was the path taken reasonable?). Optimizing only for outcome lets agents find brittle shortcuts. Optimizing only for trajectory creates agents that follow the right process but produce wrong results.

Core metrics every agent system needs

These are the metrics Laxaar tracks in every agent deployment:

Task success rate. The fraction of tasks the agent completes correctly end-to-end. Define "correct" before you start. For a research agent, is it enough to return an answer, or does it need cited sources? Ambiguity in the definition makes this metric useless.

Tool call accuracy. What fraction of tool calls are valid (correct function name, correct parameters, parameters that pass schema validation)? A high task success rate with low tool call accuracy usually means the agent is recovering from its own mistakes, which is expensive and fragile.

Step efficiency. Average steps to complete a task vs. the minimum required. An agent that takes 12 steps to do a 5-step task isn't broken, but it's expensive and slow. Track this separately from success rate.

Latency percentiles. p50, p90, p99 wall-clock time per task completion. Agent latency is spiky; a tool call that times out adds a long tail. p99 matters more than average for user-facing systems.

Cost per task. Token cost plus tool API costs. This tends to grow quietly until it doesn't.

Error type distribution. Categorize failures: hallucinated tool calls, tool execution errors, context window exhaustion, infinite loops, wrong final answer. The distribution tells you where to invest engineering effort.

Metric	What it catches	Review cadence
Task success rate	End-to-end failures	Every deploy
Tool call accuracy	Malformed tool use	Every deploy
Step efficiency	Unnecessary loops	Weekly
p99 latency	Timeout-driven spikes	Daily
Cost per task	Token/API budget drift	Daily
Error type distribution	Systemic failure modes	Weekly

Evaluation frameworks compared

Three categories of tooling exist for agent evaluation, and they serve different needs.

LangSmith (LangChain's observability and eval platform) is the most integrated option if you're already using LangChain or LangGraph. It captures full traces automatically (every LLM call, every tool call, every state transition) and lets you define evaluators that run against those traces. The UI is good for exploring failure cases. The limitation is that it's tightly coupled to the LangChain ecosystem; grafting it onto a non-LangChain agent requires manual trace logging.

Braintrust is framework-agnostic and built specifically for LLM evaluation. You define datasets, run experiments against them, and get statistical comparison across runs and model versions. It's particularly strong for A/B-style evaluation (comparing two agent versions on the same task set). The scoring API is flexible enough to handle custom trajectory evaluators.

Custom harnesses are the right answer when your tasks are domain-specific enough that generic evaluation tools don't capture correctness. A customer support agent needs domain-specific success criteria; a generic "did it follow instructions" evaluator will miss the cases that matter. Building a custom harness takes more upfront work but produces more reliable signal.

Framework	Best for	Ecosystem fit	Trajectory support
LangSmith	LangChain/LangGraph agents	LangChain	Yes, via trace inspection
Braintrust	A/B comparison, multi-version eval	Framework-agnostic	Yes, custom scorers
Custom harness	Domain-specific correctness	Any	Full control
AgentEvals (OpenAI)	OpenAI Agents SDK deployments	OpenAI	Limited

Our opinion: start with LangSmith or Braintrust to get traces and basic metrics fast, then build custom evaluators on top for the domain-specific cases. Don't try to build a full evaluation harness before you know what failure modes you're actually seeing.

Building a task trace evaluator

A task trace is the complete record of an agent run: the input, every tool call and its result, intermediate reasoning (if the model exposes it), and the final output. Evaluation happens against traces, not live agent runs. You capture traces in production, then replay them through evaluators offline.

# Minimal task trace structure and evaluator (Python 3.11+)
from dataclasses import dataclass, field
from typing import Literal

@dataclass
class ToolCall:
    name: str
    args: dict
    result: str | None
    error: str | None = None
    duration_ms: float = 0.0

@dataclass
class TaskTrace:
    task_id: str
    input: str
    expected_output: str | None  # None for open-ended tasks
    tool_calls: list[ToolCall] = field(default_factory=list)
    final_output: str = ""
    outcome: Literal["success", "failure", "partial"] = "failure"
    total_tokens: int = 0
    duration_ms: float = 0.0

def evaluate_tool_accuracy(trace: TaskTrace) -> float:
    """Returns the fraction of tool calls that completed without error."""
    if not trace.tool_calls:
        return 1.0
    successful = sum(1 for tc in trace.tool_calls if tc.error is None)
    return successful / len(trace.tool_calls)

def evaluate_step_efficiency(trace: TaskTrace, min_steps: int) -> float:
    """Returns 1.0 if agent used min_steps, degrades linearly above that."""
    actual = len(trace.tool_calls)
    if actual <= min_steps:
        return 1.0
    # Score decays as steps exceed minimum; cap at 0.0
    return max(0.0, 1.0 - (actual - min_steps) / min_steps)

def run_evaluation_suite(traces: list[TaskTrace], min_steps_per_task: dict[str, int]) -> dict:
    success_rate = sum(1 for t in traces if t.outcome == "success") / len(traces)
    tool_accuracy = sum(evaluate_tool_accuracy(t) for t in traces) / len(traces)
    avg_efficiency = sum(
        evaluate_step_efficiency(t, min_steps_per_task.get(t.task_id, 1))
        for t in traces
    ) / len(traces)
    
    return {
        "task_success_rate": round(success_rate, 3),
        "tool_call_accuracy": round(tool_accuracy, 3),
        "avg_step_efficiency": round(avg_efficiency, 3),
        "total_traces": len(traces),
    }

The expected_output field is optional because many agent tasks don't have a single correct answer. For those, you need LLM-as-judge scoring on the final output, covered in the next section.

LLM-as-judge for trajectory scoring

For tasks where correctness isn't binary, you need a model to score the agent's output and path. LLM-as-judge means using a separate model call (usually a stronger or differently-prompted model) to evaluate the agent's work.

# LLM-as-judge evaluator using Anthropic's Claude
import anthropic

client = anthropic.Anthropic()

JUDGE_PROMPT = """You are evaluating an AI agent's task completion.

Task: {task}

Agent's final output:
{output}

Agent's tool calls (in order):
{tool_calls}

Score the agent on two dimensions:
1. Output quality (0-10): Does the final output correctly and completely address the task?
2. Trajectory quality (0-10): Was the sequence of tool calls logical and efficient?

Respond in this exact format:
OUTPUT_SCORE: <number>
TRAJECTORY_SCORE: <number>
REASONING: <one sentence explaining each score>"""

def judge_trace(trace: TaskTrace) -> dict:
    tool_summary = "\n".join(
        f"  {i+1}. {tc.name}({tc.args}) -> {'ERROR: ' + tc.error if tc.error else 'OK'}"
        for i, tc in enumerate(trace.tool_calls)
    )
    
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=256,
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT.format(
                task=trace.input,
                output=trace.final_output,
                tool_calls=tool_summary,
            )
        }]
    )
    
    text = response.content[0].text
    lines = {line.split(":")[0].strip(): line.split(":", 1)[1].strip()
             for line in text.strip().splitlines() if ":" in line}
    
    return {
        "output_score": float(lines.get("OUTPUT_SCORE", 0)),
        "trajectory_score": float(lines.get("TRAJECTORY_SCORE", 0)),
        "reasoning": lines.get("REASONING", ""),
    }

LLM-as-judge has known failure modes worth knowing: it tends to prefer longer, more elaborate outputs over concise correct ones, and it can be sycophantic about outputs that sound confident even when they're wrong. Use multiple judge calls with different phrasings and average the scores. For high-stakes evaluation, mix automated scoring with periodic human review on a random sample.

Running evaluation in CI

Evaluation in CI means your test suite includes agent eval runs that gate deployment. This sounds expensive (agent tasks aren't free), so the practical approach is a tiered test set.

# GitHub Actions workflow for agent evaluation
name: Agent Evaluation

on:
  pull_request:
    paths:
      - 'src/agents/**'
      - 'src/tools/**'

jobs:
  eval-smoke:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run smoke eval (fast, cheap)
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          python -m pytest tests/eval/smoke/ \
            --eval-budget=20 \
            --min-success-rate=0.85 \
            --timeout=120

  eval-regression:
    runs-on: ubuntu-latest
    if: github.event.pull_request.base.ref == 'main'
    steps:
      - uses: actions/checkout@v4
      - name: Run regression eval (full task set)
        run: |
          python -m pytest tests/eval/regression/ \
            --eval-budget=100 \
            --min-success-rate=0.90 \
            --compare-baseline=main

The smoke suite covers 15–25 representative tasks and runs fast enough that developers don't skip it. The regression suite runs against the full task set on merges to main and compares against a stored baseline. When the success rate drops by more than a threshold, the merge is blocked.

Store baseline results as artifacts in your CI system, not hardcoded numbers. You want to detect regressions relative to what the agent was doing before the change, not relative to an arbitrary target you set months ago.

The full picture of how agent evaluation fits into a production deployment is covered in our guide to building production AI agents. For the architectural side of setting up reliable agent systems, see our AI agents expertise page.

Frequently Asked Questions

How many test cases do I need for a reliable agent evaluation?

Statistical reliability requires at least 30–50 tasks per agent type to detect a 10-percentage-point change in success rate with reasonable confidence. For high-stakes agents, use 100+. Fewer than 20 tasks produces noisy results; a single flaky run can swing your numbers by 5 points, which makes regression detection unreliable.

Should I use the same model for judging that I use for the agent?

Generally, no. If your agent uses GPT-4o, use Claude or a fine-tuned judge model to evaluate it. Same-model judgment introduces bias: the judge will tend to rate the output highly if it matches patterns the base model finds plausible, which isn't the same as being correct. Cross-model judging is more calibrated.

How do I evaluate an agent that has side effects — like sending emails or writing to a database?

Use sandboxed tool implementations in evaluation. Replace real tool calls with mock versions that record what would have happened without executing it. The agent's behavior is identical from its perspective; you evaluate the action sequence without the side effects. After evaluating action correctness, you can optionally run a smaller set of end-to-end tests in a staging environment with real tools.

What's a good task success rate target for a production agent?

It depends on task complexity and what failure means. Simple, well-scoped tasks (look up a record and return a field) should hit 97%+. Multi-step research tasks with ambiguous goals might reasonably target 75–85%. The key question is: what does a failure cost, and how often does the agent run? A 10% failure rate on an agent that runs 10,000 times a day is 1,000 failed tasks, probably unacceptable. The same rate on a tool used 20 times a day may be fine.

Can I reuse my RAG evaluation setup for agent evaluation?

Partially. Retrieval quality metrics (recall, precision, NDCG) apply to the agent's retrieval tool calls. But that's only one part of agent evaluation. You still need task-level success metrics, tool sequence evaluation, and cost tracking that RAG evaluation frameworks don't provide. Think of RAG eval as a component-level test within your broader agent evaluation suite.

Want Laxaar to set up an evaluation pipeline for your agent system before it ships? Reach out. We can instrument your traces, build your test suite, and set the success-rate thresholds that match your production risk tolerance.