Agent Planning Techniques for Reliable Execution
Explore agent planning techniques — chain-of-thought, tree-of-thought, MCTS, and task decomposition — that make AI agents reliable on complex multi-step tasks.

An agent without a planning strategy is just an LLM with tools. It picks the next action based on what looks right in the moment, and for tasks longer than 5–6 steps, "looks right in the moment" reliably produces drift, loops, and incomplete work. We've debugged enough production failures at Laxaar to say this with confidence: planning technique is as important as model choice.
Agent planning techniques are the methods an agent uses to decompose a goal into executable steps, decide the order of operations, and recover when a step doesn't go as expected. The right technique depends on task complexity, how reversible the actions are, and how much latency you can absorb.
This article covers four techniques you'll actually use in production (chain-of-thought decomposition, tree-of-thought search, task graph planning, and Monte Carlo Tree Search) with honest notes on cost and where each breaks.
What you'll learn
- Why planning matters for agent reliability
- Chain-of-thought decomposition: the baseline
- Tree-of-thought: exploring multiple paths
- Task graph planning: explicit dependencies
- Monte Carlo Tree Search for agent decisions
- Planning technique comparison
- Frequently Asked Questions
Why planning matters for agent reliability
Planning is the gap between knowing what the goal is and knowing what to do next. For a 2-step task, an LLM bridges this gap intuitively. For a 20-step task with branching conditions, without explicit planning the model loses track of where it is relative to the overall goal, repeats steps it already completed, or takes actions that contradict earlier decisions.
The failure mode isn't usually a catastrophic crash. It's subtle drift: the agent completes something, but not quite the right thing, or completes 18 of 20 steps and stalls on step 19 because it's forgotten the context from step 3. These failures are expensive to debug because the agent's behavior looks plausible at each individual step.
Three signals tell you a planning technique is needed: tasks with more than 8 steps, tasks where one action affects what subsequent actions should be, and tasks where incorrect intermediate steps are costly or irreversible.
Chain-of-thought decomposition: the baseline
Chain-of-thought (CoT) decomposition prompts the model to produce an explicit step list before taking any action. It's the simplest planning technique and the right starting point for most agent systems.
# Chain-of-thought planning with structured output (LangChain v0.2+)
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
class ExecutionPlan(BaseModel):
goal: str = Field(description="Restatement of the user's goal")
steps: list[str] = Field(description="Ordered list of concrete steps to accomplish the goal")
success_criteria: str = Field(description="How to know the task is complete")
risks: list[str] = Field(description="Potential failure points and how to handle them")
llm = ChatOpenAI(model="gpt-4o", temperature=0)
structured_llm = llm.with_structured_output(ExecutionPlan)
planner_prompt = ChatPromptTemplate.from_messages([
("system", """You are a precise task planner. Given a goal, produce a concrete step-by-step plan.
Each step must be specific enough that an agent can execute it without ambiguity.
Identify risks that could cause steps to fail and note how to handle them."""),
("human", "Goal: {goal}\n\nAvailable tools: {tool_descriptions}")
])
planner = planner_prompt | structured_llm
plan = planner.invoke({
"goal": "Research competitors in the CRM space and produce a comparison report",
"tool_descriptions": "web_search, read_url, write_file, create_table"
})
print(f"Steps: {len(plan.steps)}")
for i, step in enumerate(plan.steps, 1):
print(f" {i}. {step}")
The structured output format matters. A plan that returns a flat list of strings gives the executor no information about dependencies or failure handling. Adding success_criteria and risks fields forces the model to think through the plan more carefully and gives your execution layer something to check against.
CoT decomposition has one notable weakness: it doesn't explore alternatives. The model generates one plan, and if that plan is wrong or hits an unexpected blocker, you need replanning logic. For most tasks, this is fine. Add a replanning node that fires when a step fails and generates a revised plan from the current state.
Tree-of-thought: exploring multiple paths
Tree-of-thought (ToT) extends chain-of-thought by generating multiple candidate plans (or multiple candidate next steps), evaluating each, and selecting the best one. It's more expensive (you're making 3–5x more LLM calls), but it produces better plans for tasks where the first idea is often not the best one.
# Tree-of-thought: generate, evaluate, select
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
import asyncio
llm = ChatOpenAI(model="gpt-4o", temperature=0.7) # Higher temp for diversity
evaluator_llm = ChatOpenAI(model="gpt-4o", temperature=0)
async def generate_candidate_plans(goal: str, tools: str, n: int = 3) -> list[str]:
"""Generate n diverse candidate plans for the same goal."""
prompt = ChatPromptTemplate.from_messages([
("system", "Generate a concrete step-by-step execution plan. Be specific about tool usage."),
("human", "Goal: {goal}\nTools: {tools}\nApproach #{approach_num}: think of a distinct way to accomplish this.")
])
tasks = [
(prompt | llm).ainvoke({"goal": goal, "tools": tools, "approach_num": i})
for i in range(1, n + 1)
]
responses = await asyncio.gather(*tasks)
return [r.content for r in responses]
async def evaluate_plans(goal: str, plans: list[str]) -> str:
"""Score each plan and return the best one."""
evaluator_prompt = ChatPromptTemplate.from_messages([
("system", """Evaluate these candidate plans for completing a goal.
Score each on: completeness (1-5), efficiency (1-5), risk handling (1-5).
Return the index (0-based) of the best plan and explain why."""),
("human", "Goal: {goal}\n\nCandidates:\n{candidates}")
])
candidates_text = "\n\n---\n\n".join(
f"Plan {i}:\n{plan}" for i, plan in enumerate(plans)
)
response = await (evaluator_prompt | evaluator_llm).ainvoke({
"goal": goal,
"candidates": candidates_text,
})
# Parse the best index from the response
# In production, use structured output here
best_idx = int(response.content.split("Plan ")[1][0])
return plans[best_idx]
async def tree_of_thought_plan(goal: str, tools: str) -> str:
candidates = await generate_candidate_plans(goal, tools, n=3)
best_plan = await evaluate_plans(goal, candidates)
return best_plan
ToT is worth the cost when mistakes are expensive. If a wrong plan triggers irreversible API calls (sending emails, posting to external systems, modifying production data), spending 3x on planning to get a better plan upfront is cheap insurance. Don't use it for quick retrieval tasks where any reasonable plan works fine.
The evaluation step is where most implementations go wrong. Evaluating plans in the abstract (without knowing what the tools actually return) is hard. Better evaluators check structural properties: does the plan handle the most common failure case? Does it use the cheapest tool first? Is there a verification step before any write operations?
Task graph planning: explicit dependencies
Task graph planning represents the task as a directed acyclic graph (DAG) where nodes are subtasks and edges are dependencies. Tasks with no incoming edges can run immediately; tasks with incoming edges wait for their dependencies to complete. This enables real parallelism.
# Task graph planning with LangGraph (v0.2+)
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator
class TaskGraphState(TypedDict):
goal: str
task_graph: dict # {task_id: {"description": str, "deps": list[str], "status": str, "result": str}}
completed_tasks: Annotated[list, operator.add]
def build_task_graph(goal: str, llm) -> dict:
"""Ask the LLM to produce a dependency graph for the goal."""
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel
class Task(BaseModel):
id: str
description: str
dependencies: list[str] # IDs of tasks that must complete first
class TaskGraph(BaseModel):
tasks: list[Task]
prompt = ChatPromptTemplate.from_messages([
("system", """Decompose the goal into parallel tasks with dependencies.
Tasks with no dependencies can run in parallel.
Only add a dependency when a task truly needs output from another task."""),
("human", "Goal: {goal}")
])
graph = (prompt | llm.with_structured_output(TaskGraph)).invoke({"goal": goal})
return {
task.id: {
"description": task.description,
"deps": task.dependencies,
"status": "pending",
"result": None,
}
for task in graph.tasks
}
def get_ready_tasks(task_graph: dict) -> list[str]:
"""Return IDs of tasks whose dependencies are all complete."""
return [
task_id
for task_id, task in task_graph.items()
if task["status"] == "pending"
and all(task_graph[dep]["status"] == "complete" for dep in task["deps"])
]
Task graph planning is the architecture that enables genuine parallelism: tasks with no shared dependencies execute concurrently, cutting wall-clock time proportionally. A research task that decomposes into 4 independent subtasks (search for X, search for Y, look up Z, retrieve document W) completes in roughly 1/4 the time of a sequential plan.
The engineering overhead is real. You need a runtime that can execute DAG nodes concurrently, propagate results between dependent nodes, and handle partial failures (one node fails; what do the dependent nodes do?). LangGraph handles this well with its parallel branch support. The Laxaar team uses task graph planning specifically for research and data-gathering workflows where task independence is high and latency matters.
Monte Carlo Tree Search for agent decisions
Monte Carlo Tree Search (MCTS) adapts the game-tree search algorithm for agent decision-making. Instead of committing to one plan, the agent simulates multiple possible action sequences, scores the outcomes, and selects the action with the best expected outcome. It's the most compute-intensive technique here, and the most powerful for tasks where the action space is large and outcomes are hard to evaluate a priori.
# Simplified MCTS for agent step selection
import math
import random
from dataclasses import dataclass, field
@dataclass
class MCTSNode:
state: dict # Current agent state
action_taken: str # What action produced this state
parent: 'MCTSNode | None' = None
children: list = field(default_factory=list)
visits: int = 0
total_score: float = 0.0
@property
def ucb1_score(self) -> float:
"""Upper Confidence Bound — balances exploitation vs exploration."""
if self.visits == 0:
return float('inf')
exploitation = self.total_score / self.visits
exploration = math.sqrt(2 * math.log(self.parent.visits) / self.visits)
return exploitation + exploration
def mcts_select_action(
current_state: dict,
candidate_actions: list[str],
simulator, # Function: (state, action) -> (new_state, score)
n_simulations: int = 50,
) -> str:
root = MCTSNode(state=current_state, action_taken="root")
for _ in range(n_simulations):
# Selection: pick node with highest UCB1
node = root
while node.children:
node = max(node.children, key=lambda n: n.ucb1_score)
# Expansion: try an untried action
tried = {c.action_taken for c in node.children}
untried = [a for a in candidate_actions if a not in tried]
if untried:
action = random.choice(untried)
new_state, score = simulator(node.state, action)
child = MCTSNode(state=new_state, action_taken=action, parent=node)
child.visits = 1
child.total_score = score
node.children.append(child)
# Backpropagation
while node:
node.visits += 1
node.total_score += score
node = node.parent
# Return the action with the most visits (most explored = most promising)
best_child = max(root.children, key=lambda n: n.visits)
return best_child.action_taken
MCTS requires a simulator: a fast function that predicts the outcome of an action without actually executing it. For agent systems, this is usually a lightweight LLM call that scores a hypothetical action sequence. The quality of your simulator determines the quality of MCTS decisions.
This is not a technique for every system. MCTS makes sense when: the action space is large (many possible next steps), the cost of a wrong action is high, and you have a reasonable scoring function. We've seen it applied effectively in code generation agents (where the action is which edit to make to a file) and in planning agents for complex multi-constraint scheduling problems.
Planning technique comparison
| Technique | LLM calls per plan | Parallelism | Best for | Main cost |
|---|---|---|---|---|
| Chain-of-thought | 1 | None | Most tasks, baseline | Single-plan rigidity |
| Tree-of-thought | 4–6 | Plan generation only | High-stakes, irreversible tasks | 4-6x LLM cost |
| Task graph | 1 + execution | Full parallel execution | Research, data gathering | DAG runtime complexity |
| MCTS | 50–200 | Simulation phase | Large action spaces, constrained planning | Very high LLM cost |
The honest ranking for most production systems: start with chain-of-thought decomposition. Add replanning on failure. When you need parallelism, shift to task graph. Reach for tree-of-thought only when plan quality is the limiting factor and you have the cost budget. MCTS is a specialist tool.
One pattern we've found consistently effective at Laxaar: combine chain-of-thought planning with a lightweight verification step before execution. The planner produces a step list; a second LLM call checks the plan against the available tools and flags any steps that can't be executed. This catches obvious planning errors before they become runtime failures. It costs one extra LLM call and prevents a lot of debugging.
For more on how these planning techniques fit into different agent control structures, see AI agent architectures compared. If you're also thinking about how agents retain information across planning cycles, agent memory systems explained covers the persistence layer in depth.
Our AI automation expertise covers how Laxaar applies these planning patterns in production automation systems.
Frequently Asked Questions
Which planning technique should I start with for a new agent?
Chain-of-thought decomposition with structured output. It's cheap, debuggable, and handles the majority of production use cases. Build a replanning node that fires when a step fails; that covers the main weakness of single-plan approaches. Upgrade to a more complex technique only when you've measured a specific failure mode that simpler planning doesn't handle.
How do I make replanning work without the agent losing context?
Pass the original goal, the completed steps (with their results), the failed step, and the failure reason into the replanning prompt. The replanner produces a revised list of remaining steps only; it doesn't re-plan the whole task from scratch. This keeps the agent grounded in what's already been accomplished and focuses replanning on the actual problem.
Does tree-of-thought actually produce better plans than chain-of-thought?
For complex tasks with multiple viable approaches, yes, measurably so in benchmarks. For well-specified tasks with an obvious best path, the difference is small and usually not worth the 4-6x cost increase. The practical test: run both on 20 examples of your actual task, compare outcomes, and decide based on the measured delta.
How do I build a MCTS simulator for an LLM agent?
The simplest approach is a prompt that asks the model to score an action sequence from 0-10 given the current state and goal. More sophisticated simulators execute actions against a sandboxed environment and measure how close the resulting state is to the goal state. The sandboxed execution approach is more accurate but much more expensive to build. Start with the LLM-based scorer and validate against real outcomes before investing in a full simulator.
Can planning techniques be applied to streaming agent responses?
Yes, with care. The plan is generated upfront (non-streaming), and execution of each step can stream tool outputs to the user. The trick is that replanning, if a step fails, interrupts the stream. Design your UI to handle plan-revision events gracefully rather than treating the stream as a simple linear sequence.
What's the relationship between planning and agent memory?
Planning determines the sequence of actions; memory determines what information is available during planning and execution. Better memory makes better plans possible. An agent with episodic memory of past similar tasks can plan around known failure modes. In practice, the planning prompt should include relevant retrieved memories (past similar tasks, known constraints) before asking the model to produce a step list.
If your agent system is hitting reliability problems on complex tasks, the planning layer is usually where the fix lives. Talk to the Laxaar team and we can review your current approach and help you find the planning technique that matches your workload.


