Building Production AI Agents: A Field Guide

The gap between a demo agent and a production agent is wider than most teams expect. A prototype that impresses in a controlled environment will fail in ways you didn't anticipate once real users run it on inputs you never considered. Production AI agents face tool timeouts, ambiguous inputs, partial failures, context window exhaustion, and model behavior that shifts with API updates. None of these show up in demos.

Production AI agents are autonomous systems that handle real tasks with real consequences, reliably, across a range of inputs, without human supervision on every run. The "reliably" qualifier is the hard part. A prototype is allowed to fail gracefully on edge cases. A production agent needs to fail gracefully, log what happened, and ideally recover, all without losing user trust.

At Laxaar we've taken several agent systems from demo to production. The patterns in this guide come from that process, including the failures we didn't anticipate until they happened.

Define reliability before you build

Before writing any agent code, define what "working correctly" means for your specific system. This sounds obvious. Most teams skip it anyway, and it costs them later.

For a document processing agent, reliability might mean: completes 95% of documents without human intervention, never silently drops a field, escalates to a human when confidence is below a threshold. For a customer support agent it might mean: resolves 70% of tickets autonomously, never sends a response that's factually wrong about the product, always escalates billing disputes.

Write these criteria down. They become your evaluation targets, your alerting thresholds, and your deployment gates. Without them, "production ready" is whatever you say it is on a given day, which means it means nothing.

One sharp opinion worth stating: most teams set reliability targets too high for the first production deployment. An agent that resolves 60% of tickets autonomously and escalates the rest cleanly is genuinely useful. Waiting until you can hit 90% before shipping anything means waiting a long time, and you won't learn what the real failure modes are until you've run against real traffic anyway.

Tool design for production agents

Tools are the main surface area where production agents fail. A tool that works fine in development will time out under load, return unexpected data shapes when the upstream API changes, or throw exceptions the agent doesn't know how to handle.

# Production-grade tool wrapper with timeout, retry, and structured errors
import asyncio
import functools
from typing import TypedDict
import httpx

class ToolResult(TypedDict):
    success: bool
    data: dict | str | None
    error: str | None
    retry_suggested: bool

def production_tool(timeout_seconds: float = 10.0, max_retries: int = 2):
    """Decorator that wraps any tool function with timeout, retry, and error normalization."""
    def decorator(fn):
        @functools.wraps(fn)
        async def wrapper(*args, **kwargs) -> ToolResult:
            last_error = None
            for attempt in range(max_retries + 1):
                try:
                    result = await asyncio.wait_for(
                        fn(*args, **kwargs),
                        timeout=timeout_seconds,
                    )
                    return ToolResult(success=True, data=result, error=None, retry_suggested=False)
                except asyncio.TimeoutError:
                    last_error = f"Tool timed out after {timeout_seconds}s (attempt {attempt + 1})"
                    if attempt < max_retries:
                        await asyncio.sleep(2 ** attempt)  # exponential backoff
                except httpx.HTTPStatusError as e:
                    last_error = f"HTTP {e.response.status_code}: {e.response.text[:200]}"
                    if e.response.status_code < 500:
                        # 4xx errors won't improve on retry
                        break
                    if attempt < max_retries:
                        await asyncio.sleep(2 ** attempt)
                except Exception as e:
                    last_error = f"Unexpected error: {type(e).__name__}: {str(e)}"
                    break  # Don't retry unknown errors

            return ToolResult(
                success=False,
                data=None,
                error=last_error,
                retry_suggested=False,
            )
        return wrapper
    return decorator

@production_tool(timeout_seconds=8.0, max_retries=2)
async def fetch_customer_record(customer_id: str) -> dict:
    async with httpx.AsyncClient() as client:
        response = await client.get(
            f"https://api.internal/customers/{customer_id}",
            headers={"Authorization": f"Bearer {get_api_key()}"},
        )
        response.raise_for_status()
        return response.json()

The retry_suggested field lets the agent decide whether to retry vs. report failure to the user. That decision belongs to the agent's reasoning layer, not the tool. Tools should return structured errors that the agent can reason about, not raw Python exceptions.

Schema validation is the other critical piece. Define your tool inputs and outputs with Pydantic models and validate both directions. When an upstream API changes its response shape, you want an immediate validation error rather than the agent silently reasoning about malformed data for three more steps.

Error recovery patterns

Production agents need explicit error recovery strategies, not just "the model will figure it out." Here are the three patterns Laxaar uses most:

Graceful degradation. The agent continues with reduced functionality when a non-critical tool fails. If the enrichment API is down, proceed with the data you have and flag the gap in the output. Requires the agent's system prompt to explicitly describe which tools are optional vs. required.

Retry with reformulation. When a tool call fails due to a bad argument (not a network error), the agent reformulates the call based on the error message. This requires structured error returns; a raw stack trace doesn't give the model enough to reason about.

Escalation. When the agent can't make progress after a defined number of retries, it stops, explains what it tried and why it failed, and hands off to a human or a fallback system. This is often the right answer and teams under-design it.

# Agent loop with explicit error recovery in LangGraph (v0.2+)
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator

class AgentState(TypedDict):
    messages: Annotated[list, operator.add]
    tool_errors: list[dict]
    consecutive_failures: int
    should_escalate: bool

MAX_CONSECUTIVE_FAILURES = 3

def should_continue(state: AgentState) -> str:
    if state["should_escalate"]:
        return "escalate"
    if state["consecutive_failures"] >= MAX_CONSECUTIVE_FAILURES:
        return "escalate"
    last_message = state["messages"][-1]
    if hasattr(last_message, "tool_calls") and last_message.tool_calls:
        return "tools"
    return END

def track_tool_errors(state: AgentState, tool_results: list) -> dict:
    new_failures = sum(1 for r in tool_results if not r.get("success", True))
    consecutive = state["consecutive_failures"] + new_failures if new_failures > 0 else 0
    return {
        "consecutive_failures": consecutive,
        "tool_errors": state["tool_errors"] + [r for r in tool_results if not r.get("success")],
    }

The consecutive_failures counter is important. Random isolated tool failures are noise. Three consecutive failures on the same task usually signal that the agent is stuck in a loop, a state it won't escape without intervention.

Context and memory management at scale

Context window management is the reliability issue that surprises teams most. An agent that works perfectly on short tasks will fail unpredictably on long ones when the context fills. You need a strategy before this happens, not after.

The pattern we reach for: maintain a rolling context with a "task journal," a compressed summary of completed steps prepended to the context, with only recent tool calls kept in full detail. When the context approaches a token budget threshold, trigger summarization automatically.

# Context budget management with automatic summarization
from anthropic import Anthropic

client = Anthropic()

class ContextManager:
    def __init__(self, max_tokens: int = 80_000, summarize_at: float = 0.75):
        self.max_tokens = max_tokens
        self.summarize_threshold = int(max_tokens * summarize_at)
        self.journal: list[str] = []
        self.recent_messages: list[dict] = []

    def estimate_tokens(self, messages: list[dict]) -> int:
        # Rough estimate: 4 chars per token
        return sum(len(str(m)) // 4 for m in messages)

    def maybe_summarize(self, system_prompt: str) -> str:
        """Returns updated system prompt with journal entries if context is full."""
        current_tokens = self.estimate_tokens(self.recent_messages)
        if current_tokens < self.summarize_threshold:
            return system_prompt

        # Ask the model to summarize completed steps
        summary_response = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=512,
            system="Summarize the completed steps and key findings from this agent session in 3-5 bullet points. Be specific about what was done and what was found.",
            messages=self.recent_messages[-10:],  # Last 10 messages for summary
        )
        summary = summary_response.content[0].text
        self.journal.append(summary)

        # Keep only the last 5 messages in full
        self.recent_messages = self.recent_messages[-5:]

        journal_text = "\n\n".join(f"[Session journal entry {i+1}]\n{j}" for i, j in enumerate(self.journal))
        return f"{system_prompt}\n\n## Progress so far\n{journal_text}"

For persistent memory across sessions, pair this with a vector store for episodic retrieval. That pattern is covered in depth in our agent memory systems guide.

Observability and tracing

You can't debug what you can't see. Production agents need structured traces that capture every LLM call, every tool call, latency, token counts, and errors. Logging "the agent ran" is not observability.

# Structured trace logging with OpenTelemetry
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
import time

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("agent.tracer")

class InstrumentedAgent:
    def run_task(self, task_id: str, task_input: str) -> dict:
        with tracer.start_as_current_span("agent.task") as span:
            span.set_attribute("task.id", task_id)
            span.set_attribute("task.input_length", len(task_input))
            start = time.monotonic()
            
            try:
                result = self._execute(task_input)
                span.set_attribute("task.outcome", "success")
                span.set_attribute("task.steps", result.get("steps", 0))
                return result
            except Exception as e:
                span.set_attribute("task.outcome", "failure")
                span.set_attribute("task.error", str(e))
                span.record_exception(e)
                raise
            finally:
                span.set_attribute("task.duration_ms", (time.monotonic() - start) * 1000)

    def _call_tool(self, tool_name: str, args: dict) -> dict:
        with tracer.start_as_current_span(f"tool.{tool_name}") as span:
            span.set_attribute("tool.name", tool_name)
            span.set_attribute("tool.args_keys", list(args.keys()))
            start = time.monotonic()
            
            result = self._execute_tool(tool_name, args)
            
            span.set_attribute("tool.success", result.get("success", False))
            span.set_attribute("tool.duration_ms", (time.monotonic() - start) * 1000)
            return result

Export these traces to whatever observability stack you already use: Datadog, Grafana, Honeycomb. The structured spans make it possible to query "show me all tasks where tool X failed" or "what's the p99 latency on step 3 of this workflow." That kind of query is what turns a production incident from a 3-hour investigation into a 10-minute one.

Set up alerts on task success rate dropping below threshold, p99 latency exceeding budget, and cost-per-task crossing a daily spend limit. These three alerts catch most production problems early.

Deployment and model version management

Model API updates are a deployment risk that doesn't exist in traditional software. When OpenAI or Anthropic updates a model, behavior can shift in ways that don't show up in the changelog. An agent that was hitting 88% task success on gpt-4o-2024-11-20 might hit 82% on the next version with no code changes.

The practices that help:

Pin model versions. Use claude-opus-4-5 or gpt-4o-2024-11-20, not gpt-4o-latest. Update versions deliberately after running your eval suite, not automatically. Yes, this means you occasionally miss improvements. The stability is worth it.

Canary deploys for model updates. When updating a pinned model version, route 10% of traffic to the new version and compare success rates, latency, and cost before promoting. This is standard deployment practice for code; apply it to model updates too.

Separate config from code. System prompts, tool schemas, and model parameters belong in version-controlled config files, not hardcoded in the agent class. This lets you update prompt wording without a code deploy, and it makes diffs readable.

# agent-config.yaml — version controlled alongside code
model:
  provider: anthropic
  name: claude-opus-4-5
  temperature: 0.0  # determinism over creativity for task agents
  max_tokens: 4096

tools:
  timeout_seconds: 10
  max_retries: 2
  
reliability:
  max_consecutive_failures: 3
  context_budget_tokens: 80000
  summarize_at_fraction: 0.75

evaluation:
  min_success_rate: 0.90
  baseline_task_set: eval/tasks/regression-v3.json

For the full picture of how evaluation integrates with production deployment, see our guide to evaluating AI agents. The AI agents expertise page covers how Laxaar structures end-to-end agent delivery engagements.

Frequently Asked Questions

When should I use an agent vs. a simpler LLM pipeline?

Use an agent when the task requires decision-making about which steps to take and in what order, and that decision depends on intermediate results. If the steps are fixed and predictable, a deterministic pipeline with LLM calls at specific nodes is more reliable, cheaper, and easier to test. Agents are the right tool for genuinely dynamic workflows, not for giving structure to a fixed sequence.

How do I handle an agent that loops indefinitely?

Set a hard step limit and enforce it in your agent loop, not as a suggestion in the system prompt. If the agent hasn't completed the task in N steps, halt and return the partial state with an explanation. Also implement a "same tool call twice" detector: if the agent calls the same tool with the same arguments twice in a row, it's almost certainly stuck. Halt and escalate rather than letting it continue.

Should the agent handle its own errors or should I handle them in the orchestration layer?

Both layers need error handling, but for different reasons. Tool-level errors (timeouts, bad responses) should be caught in the tool wrapper and returned as structured error objects. The agent reasons about whether to retry or escalate. Orchestration-level errors (the agent itself crashed, context window exceeded) should be caught by the runner and handled with escalation logic the agent can't execute for itself. Don't make the agent responsible for recovering from its own execution failures.

What's the right way to update prompts in production?

Treat prompt updates as code changes. Version-control your prompts, write a summary of what changed and why, and run your eval suite before deploying. Changes that look minor often aren't. Adding a single sentence to a system prompt can shift behavior on edge cases you didn't test. A/B test significant prompt changes by routing a fraction of traffic to the new prompt and comparing metrics before full rollout.

How do I manage costs when an agent runs thousands of times per day?

Set per-task token budgets enforced in code, not just hoped for in the prompt. Implement caching for repeated tool calls (a customer lookup called three times in one session should only hit the API once). Use cheaper models for classification and routing decisions within the agent flow, reserving the expensive model for steps that need it. Track cost-per-task daily and alert when it drifts. Cost increases are usually the first signal that an agent has started doing extra work.

Ready to take your agent from prototype to production? Talk to Laxaar. We can assess your system's production readiness and identify the failure modes before your users do.