Agent Planning Techniques for Reliable Execution

An agent without a planning strategy is just an LLM with tools. It picks the next action based on what looks right in the moment, and for tasks longer than 5–6 steps, "looks right in the moment" reliably produces drift, loops, and incomplete work. We've debugged enough production failures at Laxaar to say this with confidence: planning technique is as important as model choice.

Agent planning techniques are the methods an agent uses to decompose a goal into executable steps, decide the order of operations, and recover when a step doesn't go as expected. The right technique depends on task complexity, how reversible the actions are, and how much latency you can absorb.

This article covers four techniques you'll actually use in production (chain-of-thought decomposition, tree-of-thought search, task graph planning, and Monte Carlo Tree Search) with honest notes on cost and where each breaks.

What you'll learn

Why planning matters for agent reliability
Chain-of-thought decomposition: the baseline
Tree-of-thought: exploring multiple paths
Task graph planning: explicit dependencies
Monte Carlo Tree Search for agent decisions
Planning technique comparison
Frequently Asked Questions

Why planning matters for agent reliability

Planning is the gap between knowing what the goal is and knowing what to do next. For a 2-step task, an LLM bridges this gap intuitively. For a 20-step task with branching conditions, without explicit planning the model loses track of where it is relative to the overall goal, repeats steps it already completed, or takes actions that contradict earlier decisions.

The failure mode isn't usually a catastrophic crash. It's subtle drift: the agent completes something, but not quite the right thing, or completes 18 of 20 steps and stalls on step 19 because it's forgotten the context from step 3. These failures are expensive to debug because the agent's behavior looks plausible at each individual step.

Three signals tell you a planning technique is needed: tasks with more than 8 steps, tasks where one action affects what subsequent actions should be, and tasks where incorrect intermediate steps are costly or irreversible.

Chain-of-thought decomposition: the baseline

Chain-of-thought (CoT) decomposition prompts the model to produce an explicit step list before taking any action. It's the simplest planning technique and the right starting point for most agent systems.

# Chain-of-thought planning with structured output (LangChain v0.2+)
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field

class ExecutionPlan(BaseModel):
    goal: str = Field(description="Restatement of the user's goal")
    steps: list[str] = Field(description="Ordered list of concrete steps to accomplish the goal")
    success_criteria: str = Field(description="How to know the task is complete")
    risks: list[str] = Field(description="Potential failure points and how to handle them")

llm = ChatOpenAI(model="gpt-4o", temperature=0)
structured_llm = llm.with_structured_output(ExecutionPlan)

planner_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a precise task planner. Given a goal, produce a concrete step-by-step plan.
Each step must be specific enough that an agent can execute it without ambiguity.
Identify risks that could cause steps to fail and note how to handle them."""),
    ("human", "Goal: {goal}\n\nAvailable tools: {tool_descriptions}")
])

planner = planner_prompt | structured_llm

plan = planner.invoke({
    "goal": "Research competitors in the CRM space and produce a comparison report",
    "tool_descriptions": "web_search, read_url, write_file, create_table"
})

print(f"Steps: {len(plan.steps)}")
for i, step in enumerate(plan.steps, 1):
    print(f"  {i}. {step}")

The structured output format matters. A plan that returns a flat list of strings gives the executor no information about dependencies or failure handling. Adding success_criteria and risks fields forces the model to think through the plan more carefully and gives your execution layer something to check against.

CoT decomposition has one notable weakness: it doesn't explore alternatives. The model generates one plan, and if that plan is wrong or hits an unexpected blocker, you need replanning logic. For most tasks, this is fine. Add a replanning node that fires when a step fails and generates a revised plan from the current state.

Tree-of-thought: exploring multiple paths

Tree-of-thought (ToT) extends chain-of-thought by generating multiple candidate plans (or multiple candidate next steps), evaluating each, and selecting the best one. It's more expensive (you're making 3–5x more LLM calls), but it produces better plans for tasks where the first idea is often not the best one.

# Tree-of-thought: generate, evaluate, select
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
import asyncio

llm = ChatOpenAI(model="gpt-4o", temperature=0.7)  # Higher temp for diversity
evaluator_llm = ChatOpenAI(model="gpt-4o", temperature=0)

async def generate_candidate_plans(goal: str, tools: str, n: int = 3) -> list[str]:
    """Generate n diverse candidate plans for the same goal."""
    prompt = ChatPromptTemplate.from_messages([
        ("system", "Generate a concrete step-by-step execution plan. Be specific about tool usage."),
        ("human", "Goal: {goal}\nTools: {tools}\nApproach #{approach_num}: think of a distinct way to accomplish this.")
    ])
    
    tasks = [
        (prompt | llm).ainvoke({"goal": goal, "tools": tools, "approach_num": i})
        for i in range(1, n + 1)
    ]
    responses = await asyncio.gather(*tasks)
    return [r.content for r in responses]

async def evaluate_plans(goal: str, plans: list[str]) -> str:
    """Score each plan and return the best one."""
    evaluator_prompt = ChatPromptTemplate.from_messages([
        ("system", """Evaluate these candidate plans for completing a goal.
Score each on: completeness (1-5), efficiency (1-5), risk handling (1-5).
Return the index (0-based) of the best plan and explain why."""),
        ("human", "Goal: {goal}\n\nCandidates:\n{candidates}")
    ])
    
    candidates_text = "\n\n---\n\n".join(
        f"Plan {i}:\n{plan}" for i, plan in enumerate(plans)
    )
    
    response = await (evaluator_prompt | evaluator_llm).ainvoke({
        "goal": goal,
        "candidates": candidates_text,
    })
    
    # Parse the best index from the response
    # In production, use structured output here
    best_idx = int(response.content.split("Plan ")[1][0])
    return plans[best_idx]

async def tree_of_thought_plan(goal: str, tools: str) -> str:
    candidates = await generate_candidate_plans(goal, tools, n=3)
    best_plan = await evaluate_plans(goal, candidates)
    return best_plan

ToT is worth the cost when mistakes are expensive. If a wrong plan triggers irreversible API calls (sending emails, posting to external systems, modifying production data), spending 3x on planning to get a better plan upfront is cheap insurance. Don't use it for quick retrieval tasks where any reasonable plan works fine.

The evaluation step is where most implementations go wrong. Evaluating plans in the abstract (without knowing what the tools actually return) is hard. Better evaluators check structural properties: does the plan handle the most common failure case? Does it use the cheapest tool first? Is there a verification step before any write operations?

Task graph planning: explicit dependencies

Task graph planning represents the task as a directed acyclic graph (DAG) where nodes are subtasks and edges are dependencies. Tasks with no incoming edges can run immediately; tasks with incoming edges wait for their dependencies to complete. This enables real parallelism.

# Task graph planning with LangGraph (v0.2+)
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator

class TaskGraphState(TypedDict):
    goal: str
    task_graph: dict  # {task_id: {"description": str, "deps": list[str], "status": str, "result": str}}
    completed_tasks: Annotated[list, operator.add]

def build_task_graph(goal: str, llm) -> dict:
    """Ask the LLM to produce a dependency graph for the goal."""
    from langchain_core.prompts import ChatPromptTemplate
    from pydantic import BaseModel
    
    class Task(BaseModel):
        id: str
        description: str
        dependencies: list[str]  # IDs of tasks that must complete first
    
    class TaskGraph(BaseModel):
        tasks: list[Task]
    
    prompt = ChatPromptTemplate.from_messages([
        ("system", """Decompose the goal into parallel tasks with dependencies.
Tasks with no dependencies can run in parallel.
Only add a dependency when a task truly needs output from another task."""),
        ("human", "Goal: {goal}")
    ])
    
    graph = (prompt | llm.with_structured_output(TaskGraph)).invoke({"goal": goal})
    
    return {
        task.id: {
            "description": task.description,
            "deps": task.dependencies,
            "status": "pending",
            "result": None,
        }
        for task in graph.tasks
    }

def get_ready_tasks(task_graph: dict) -> list[str]:
    """Return IDs of tasks whose dependencies are all complete."""
    return [
        task_id
        for task_id, task in task_graph.items()
        if task["status"] == "pending"
        and all(task_graph[dep]["status"] == "complete" for dep in task["deps"])
    ]

Task graph planning is the architecture that enables genuine parallelism: tasks with no shared dependencies execute concurrently, cutting wall-clock time proportionally. A research task that decomposes into 4 independent subtasks (search for X, search for Y, look up Z, retrieve document W) completes in roughly 1/4 the time of a sequential plan.

The engineering overhead is real. You need a runtime that can execute DAG nodes concurrently, propagate results between dependent nodes, and handle partial failures (one node fails; what do the dependent nodes do?). LangGraph handles this well with its parallel branch support. The Laxaar team uses task graph planning specifically for research and data-gathering workflows where task independence is high and latency matters.

Monte Carlo Tree Search for agent decisions

Monte Carlo Tree Search (MCTS) adapts the game-tree search algorithm for agent decision-making. Instead of committing to one plan, the agent simulates multiple possible action sequences, scores the outcomes, and selects the action with the best expected outcome. It's the most compute-intensive technique here, and the most powerful for tasks where the action space is large and outcomes are hard to evaluate a priori.

# Simplified MCTS for agent step selection
import math
import random
from dataclasses import dataclass, field

@dataclass
class MCTSNode:
    state: dict           # Current agent state
    action_taken: str     # What action produced this state
    parent: 'MCTSNode | None' = None
    children: list = field(default_factory=list)
    visits: int = 0
    total_score: float = 0.0
    
    @property
    def ucb1_score(self) -> float:
        """Upper Confidence Bound — balances exploitation vs exploration."""
        if self.visits == 0:
            return float('inf')
        exploitation = self.total_score / self.visits
        exploration = math.sqrt(2 * math.log(self.parent.visits) / self.visits)
        return exploitation + exploration

def mcts_select_action(
    current_state: dict,
    candidate_actions: list[str],
    simulator,        # Function: (state, action) -> (new_state, score)
    n_simulations: int = 50,
) -> str:
    root = MCTSNode(state=current_state, action_taken="root")
    
    for _ in range(n_simulations):
        # Selection: pick node with highest UCB1
        node = root
        while node.children:
            node = max(node.children, key=lambda n: n.ucb1_score)
        
        # Expansion: try an untried action
        tried = {c.action_taken for c in node.children}
        untried = [a for a in candidate_actions if a not in tried]
        if untried:
            action = random.choice(untried)
            new_state, score = simulator(node.state, action)
            child = MCTSNode(state=new_state, action_taken=action, parent=node)
            child.visits = 1
            child.total_score = score
            node.children.append(child)
        
        # Backpropagation
        while node:
            node.visits += 1
            node.total_score += score
            node = node.parent
    
    # Return the action with the most visits (most explored = most promising)
    best_child = max(root.children, key=lambda n: n.visits)
    return best_child.action_taken

MCTS requires a simulator: a fast function that predicts the outcome of an action without actually executing it. For agent systems, this is usually a lightweight LLM call that scores a hypothetical action sequence. The quality of your simulator determines the quality of MCTS decisions.

This is not a technique for every system. MCTS makes sense when: the action space is large (many possible next steps), the cost of a wrong action is high, and you have a reasonable scoring function. We've seen it applied effectively in code generation agents (where the action is which edit to make to a file) and in planning agents for complex multi-constraint scheduling problems.

Planning technique comparison

Technique	LLM calls per plan	Parallelism	Best for	Main cost
Chain-of-thought	1	None	Most tasks, baseline	Single-plan rigidity
Tree-of-thought	4–6	Plan generation only	High-stakes, irreversible tasks	4-6x LLM cost
Task graph	1 + execution	Full parallel execution	Research, data gathering	DAG runtime complexity
MCTS	50–200	Simulation phase	Large action spaces, constrained planning	Very high LLM cost

The honest ranking for most production systems: start with chain-of-thought decomposition. Add replanning on failure. When you need parallelism, shift to task graph. Reach for tree-of-thought only when plan quality is the limiting factor and you have the cost budget. MCTS is a specialist tool.

One pattern we've found consistently effective at Laxaar: combine chain-of-thought planning with a lightweight verification step before execution. The planner produces a step list; a second LLM call checks the plan against the available tools and flags any steps that can't be executed. This catches obvious planning errors before they become runtime failures. It costs one extra LLM call and prevents a lot of debugging.

For more on how these planning techniques fit into different agent control structures, see AI agent architectures compared. If you're also thinking about how agents retain information across planning cycles, agent memory systems explained covers the persistence layer in depth.

Our AI automation expertise covers how Laxaar applies these planning patterns in production automation systems.

Frequently Asked Questions

Which planning technique should I start with for a new agent?

Chain-of-thought decomposition with structured output. It's cheap, debuggable, and handles the majority of production use cases. Build a replanning node that fires when a step fails; that covers the main weakness of single-plan approaches. Upgrade to a more complex technique only when you've measured a specific failure mode that simpler planning doesn't handle.

How do I make replanning work without the agent losing context?

Pass the original goal, the completed steps (with their results), the failed step, and the failure reason into the replanning prompt. The replanner produces a revised list of remaining steps only; it doesn't re-plan the whole task from scratch. This keeps the agent grounded in what's already been accomplished and focuses replanning on the actual problem.

Does tree-of-thought actually produce better plans than chain-of-thought?

For complex tasks with multiple viable approaches, yes, measurably so in benchmarks. For well-specified tasks with an obvious best path, the difference is small and usually not worth the 4-6x cost increase. The practical test: run both on 20 examples of your actual task, compare outcomes, and decide based on the measured delta.

How do I build a MCTS simulator for an LLM agent?

The simplest approach is a prompt that asks the model to score an action sequence from 0-10 given the current state and goal. More sophisticated simulators execute actions against a sandboxed environment and measure how close the resulting state is to the goal state. The sandboxed execution approach is more accurate but much more expensive to build. Start with the LLM-based scorer and validate against real outcomes before investing in a full simulator.

Can planning techniques be applied to streaming agent responses?

Yes, with care. The plan is generated upfront (non-streaming), and execution of each step can stream tool outputs to the user. The trick is that replanning, if a step fails, interrupts the stream. Design your UI to handle plan-revision events gracefully rather than treating the stream as a simple linear sequence.

What's the relationship between planning and agent memory?

Planning determines the sequence of actions; memory determines what information is available during planning and execution. Better memory makes better plans possible. An agent with episodic memory of past similar tasks can plan around known failure modes. In practice, the planning prompt should include relevant retrieved memories (past similar tasks, known constraints) before asking the model to produce a step list.

If your agent system is hitting reliability problems on complex tasks, the planning layer is usually where the fix lives. Talk to the Laxaar team and we can review your current approach and help you find the planning technique that matches your workload.