Agent Planning: ReAct vs Plan-and-Execute Tradeoffs

Agent planning is the decision about how your agent decides what to do next. Get it wrong and you ship a system that works impressively in a five-step demo, then spirals into token-burning loops when a real user hands it a fifteen-step task. The failure isn't in the model — it's in the planning architecture.

Two patterns dominate production agentic AI systems today: ReAct (Reason + Act, alternating between thought and tool use) and plan-and-execute (generate a full plan upfront, then execute each step). Both work. Both fail. And the failure modes are completely different, which is why choosing between them matters far more than most teams realize before they've hit a production incident.

At Laxaar we've deployed both patterns across research automation, document processing, and workflow orchestration systems. Our blunt assessment: ReAct is easier to start with and harder to operate at scale; plan-and-execute takes more upfront design and is significantly cheaper and more debuggable once a task exceeds five or six steps.

What ReAct agents actually do

ReAct is an agent planning pattern where the model interleaves reasoning ("Thought:") with tool use ("Action:") and observation of results ("Observation:") in a tight loop. Each step is one LLM call. The model sees the full history of prior thoughts, actions, and observations, then decides what to do next.

# Minimal ReAct loop with LangGraph (v0.2+)
from langgraph.graph import StateGraph, MessagesState
from langgraph.prebuilt import ToolNode, tools_condition
from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-3-5-sonnet-20241022")

tools = [search_web, read_document, write_file]
llm_with_tools = llm.bind_tools(tools)

def agent(state: MessagesState):
    return {"messages": [llm_with_tools.invoke(state["messages"])]}

graph = StateGraph(MessagesState)
graph.add_node("agent", agent)
graph.add_node("tools", ToolNode(tools))
graph.set_entry_point("agent")
graph.add_conditional_edges("agent", tools_condition)
graph.add_edge("tools", "agent")

app = graph.compile()

The appeal is real. ReAct agents adapt dynamically. If a tool call returns unexpected data, the next thought can revise the approach. If the user's request turns out to be ambiguous mid-task, the agent can ask a clarifying question. There's no rigid plan to maintain or update.

The cognitive overhead stays on the model, not on you. That's why ReAct demos well: hand it a task, watch it reason its way through, it feels intelligent.

What plan-and-execute agents actually do

Plan-and-execute splits the job into two phases. First, a planner LLM call generates a complete, ordered list of steps. Then an executor handles each step, often with a smaller or cheaper model. The plan is the contract; execution fills it in.

# Plan-and-execute with LangGraph
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel
from typing import List

class Plan(BaseModel):
    steps: List[str]

planner_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a precise planner. Break the user task into ordered, concrete steps. Each step must be independently executable. Output only the step list."),
    ("user", "{task}"),
])

planner = planner_prompt | llm.with_structured_output(Plan)

def plan_step(state: dict) -> dict:
    plan = planner.invoke({"task": state["task"]})
    return {"plan": plan.steps, "current_step": 0}

def execute_step(state: dict) -> dict:
    step = state["plan"][state["current_step"]]
    # Use a cheaper executor model per step
    result = executor_llm.invoke(f"Complete this step: {step}\nContext: {state['context']}")
    return {
        "results": state["results"] + [result.content],
        "current_step": state["current_step"] + 1,
        "context": state["context"] + f"\nStep {state['current_step'] + 1} result: {result.content}",
    }

The executor doesn't need to carry the full reasoning history of the task. It only needs the current step and the accumulated context from prior steps. That separation is where the cost savings come from.

Where each pattern breaks down

ReAct's failure mode is the reasoning loop. Without an external forcing function, the model can cycle through thought-action-observation sequences that make local progress but lose the global goal. We've seen ReAct agents convince themselves after seven steps that the task requires a completely different approach — and start over, burning tokens on a new path that hits the same dead end.

ReAct also degrades as context grows. By step 12 of a 20-step task, the model is reasoning over a token-heavy history that includes every prior thought and tool output. The quality of reasoning drops, not because the model is worse, but because signal-to-noise in the context window decreases.

Plan-and-execute fails differently. The planner can generate a plan that's wrong from step one — and the executor dutifully executes each wrong step. A plan that assumes data exists in a certain format, or that a required API is available, or that an intermediate result will have a specific shape, will produce a coherent sequence of confident failures.

Plan-and-execute also handles uncertainty poorly. If you don't know exactly what you'll find until you look, a rigid upfront plan can become irrelevant by step three. This is the pattern's real limitation and why teams reflexively reach for ReAct: most interesting tasks have some discovery component.

Cost and latency comparison

This is where the numbers decide things for most production systems.

Dimension	ReAct	Plan-and-Execute
LLM calls per task	One per step (full context each time)	1 planner + N smaller executor calls
Context per call	Grows with task length	Stays lean per executor step
Token cost (10-step task)	High — full history per step	Lower — planner once, executor steps stay scoped
Latency	Sequential, full model each step	Planner blocks, executor steps can parallelize
Model flexibility	Single model handles all steps	Planner can use a capable model; executor uses a cheap one

A 10-step ReAct task running on Claude Sonnet might cost 3-5x more than the equivalent plan-and-execute run, because each ReAct step re-processes the full accumulated context. At low volume that's noise. At 10,000 tasks per day, it's a line item your CFO will notice.

Plan-and-execute also opens up step-level parallelism. If steps 3 and 4 don't depend on each other, an executor can run them simultaneously. ReAct can't do this at all — each step waits on the prior observation.

The honest trade-off: plan-and-execute pays a planning latency tax upfront. If the planner call takes 3 seconds and the task only has two steps, ReAct is faster. The crossover point in our experience is around four to five steps.

Debuggability and observability

ReAct agents are notoriously hard to debug. When a ReAct run fails at step 9, you have to trace through eight interleaved thought-action-observation blocks to understand how the agent reached a bad state. The reasoning is human-readable in isolation but tangled in practice.

Plan-and-execute gives you a concrete artifact to inspect: the plan. When something fails, you check whether the plan was sensible before execution started. If the plan was reasonable and execution failed, you isolate the bad executor step. The debugging surface is smaller because the concern separation is cleaner.

For the AI agent observability tooling we set up on Laxaar projects, plan-and-execute traces are significantly easier to parse. Each executor step maps to a discrete span with a clear input (the step description plus context) and output (the step result). Langfuse, Arize Phoenix, and Helicone all surface this structure cleanly when you instrument at the step level.

ReAct traces are a single long span with nested tool calls. You can instrument it, but extracting "where did the agent go wrong" from a 40-turn trace requires purpose-built tooling or a lot of manual scrolling.

How to choose for your use case

The decision rule we apply at Laxaar is based on task variance and step count.

Use ReAct when:

The task has fewer than five steps.
You can't enumerate the steps until you see intermediate results.
The domain is exploratory: research, open-ended investigation, or tasks where the user's intent clarifies through interaction.
Failure recovery needs to happen dynamically within the run.

Use plan-and-execute when:

The task has six or more well-defined steps.
The task shape is predictable enough for a planner to generate a valid sequence upfront.
Cost and token efficiency matter (high volume, cost-sensitive workloads).
You need parallel step execution.
Reproducibility and auditability are requirements (the plan is a record of intent).

The single most common mistake is choosing ReAct by default because it's simpler to implement. It is simpler at first. The complexity catches up in production when your 15-step agent starts billing $0.40 per run instead of the expected $0.08.

For teams building agentic AI systems, the question to ask is: "Can a competent person write a checklist for this task before starting?" If yes, plan-and-execute is worth the design investment. If the task genuinely requires adaptive reasoning at each step before you know what the next step is, ReAct earns its cost.

Hybrid approaches worth knowing

The binary framing is useful for decision-making but real production systems often land somewhere between the two.

Replanning: Start with plan-and-execute, but give the executor the ability to trigger a replanning call if it encounters something the original plan didn't account for. This keeps the cost efficiency of plan-and-execute while recovering ReAct's adaptability for the unexpected cases.

Hierarchical planning: A high-level planner breaks the task into phases; a sub-planner for each phase generates step-level detail. The LLM at each level reasons over a smaller, more focused context. This scales to tasks that are too complex for a single planning call without falling back to ReAct's step-by-step improvisation.

Bounded ReAct: Run ReAct but with a hard step budget. At step N (say, 8), force a summary of progress and a replanning decision: continue with the current approach or reset with a plan. This prevents the runaway reasoning loops that make ReAct expensive at scale.

# Bounded ReAct with forced checkpoint
MAX_STEPS = 8

def should_replan(state: MessagesState) -> str:
    step_count = sum(1 for m in state["messages"] if m.type == "tool")
    if step_count >= MAX_STEPS:
        return "replan"
    return tools_condition(state)  # normal ReAct routing

graph.add_conditional_edges("agent", should_replan, {
    "replan": "planner",
    "tools": "tools",
    "__end__": "__end__",
})

These hybrids are worth knowing about, but we'd caution against reaching for them before you've validated that a simpler pattern doesn't work. Build the simplest thing first, measure its failure modes, then graduate to a hybrid.

The custom software development lens applies here too: the best agent architecture is the one that solves the actual problem, not the one that looks most sophisticated in a diagram.

Frequently Asked Questions

Is ReAct always more expensive than plan-and-execute?

Not always. For short tasks (three steps or fewer), ReAct can be cheaper because you skip the planner call overhead. The economics shift once task length grows, because ReAct re-processes the full accumulated context on every step. For tasks of six or more steps, plan-and-execute is almost always more cost-efficient.

Can plan-and-execute handle tasks where you don't know all the steps upfront?

With replanning, yes. The planner generates a best-effort plan based on available information; if execution reveals that the plan is no longer valid, a replanning call generates a revised sequence. This pattern captures most of plan-and-execute's cost benefits while handling the uncertainty that makes ReAct attractive.

How do I instrument ReAct agents for observability?

Treat each thought-action-observation triplet as a logical step and create a parent span for the full run with child spans per triplet. Log the model's reasoning content (the "thought") alongside the tool call and result. LLM observability tools like Langfuse support this natively via their trace and span APIs. The goal is to be able to replay a failing run from the trace alone, without having to re-run it.

Does the choice of planning pattern affect which LLM I should use?

Yes. Plan-and-execute lets you use different models for planning and execution. A capable model (Claude Sonnet or GPT-4o) handles planning, where nuanced reasoning matters. A smaller, faster, cheaper model handles execution, where you're filling in a discrete step with bounded context. ReAct uses one model for everything, which means you're either over-spending on executor steps or under-powering the reasoning steps.

What's the biggest mistake teams make when implementing these patterns?

For ReAct: shipping without a step budget or context trimming, then watching costs and latency balloon in production. For plan-and-execute: building the planner without structured output (using JSON schema or a Pydantic model), so the executor gets plans it can't reliably parse. Always use structured output for plan generation. A plan that's valid JSON but semantically wrong is at least debuggable; a plan that's malformed text fails unpredictably.

Building an agent system and not sure which planning pattern fits your task? Talk to the Laxaar team — we can review your use case and help you avoid the architecture choices that look fine in a demo but break under production load.