Creating Agentic Coding Workflows

Last quarter a client handed us a failing CI pipeline that nobody wanted to maintain. The agent we shipped two weeks later reviews every PR, runs the affected tests, and posts findings as review comments. No human touches the process until a finding needs a judgment call. That's an agentic coding workflow: a pipeline where an AI agent drives execution (reading code, running tests, making changes, iterating on results) rather than waiting for a human to act on suggestions.

Building something like this that holds up in production requires sequencing your decisions differently than you might expect. Task decomposition comes before model selection. Error recovery matters more than the happy path. Auditability is not optional. The Laxaar team has shipped these workflows across web, mobile, and cloud projects, and this tutorial covers what actually works at scale.

Prerequisites: Familiarity with at least one AI coding tool (Claude Code, Cursor, or similar), Node.js 20+ or Python 3.11+, and a Git repository to work with.

What you'll build

Decompose a coding task into an agent-friendly graph
Select and configure the right tools for each step
Write a workflow orchestrator that handles failures
Integrate the workflow into your CI pipeline
Add observability so you can debug agent runs

Step 1: Decompose the task

The single biggest mistake in agentic workflow design is giving the agent one giant task and hoping it figures out the steps. It won't. Not reliably. Break the work into a directed graph of discrete steps, where each step has a clear input, a clear output, and a verification condition.

Here's a task decomposition for an automated code review workflow:

Task: Review a pull request and post findings

Step 1: Fetch PR diff
  Input: PR number
  Output: unified diff string
  Verify: diff is non-empty

Step 2: Analyze each changed file
  Input: list of changed files + their diffs
  Output: findings per file (array of {file, line, severity, message})
  Verify: findings array is valid JSON, severity in ['error','warning','info']

Step 3: Check for failing tests
  Input: list of changed source files
  Output: test results for affected test files
  Verify: test runner exit code, not just stdout

Step 4: Post review comment
  Input: findings + test results
  Output: GitHub comment ID
  Verify: HTTP 201 from GitHub API

Each step is independently testable. You can run step 2 in isolation with a canned diff, without touching GitHub. That's the property you want: the whole workflow becomes dramatically easier to debug.

In code, represent this as a typed step registry:

type StepResult<T> = { ok: true; value: T } | { ok: false; error: string };

interface WorkflowStep<TIn, TOut> {
  name: string;
  run: (input: TIn) => Promise<StepResult<TOut>>;
  verify: (output: TOut) => boolean;
}

Step 2: Select tools

Tool selection is a matching problem: what can the agent call, and does that cover every step in the graph? For coding workflows, the standard set is:

Tool category	Example	When to use
File I/O	`readFile`, `writeFile`	Reading source, writing patches
Shell execution	`execSync`, `child_process`	Running tests, builds, linters
Git operations	`git diff`, `git log`	Fetching diffs, history
HTTP client	`fetch`, `axios`	Calling GitHub, Jira, Slack APIs
Code search	`grep`, `ripgrep`, AST tools	Finding patterns across a codebase

Resist the urge to give the agent every tool at once. More tools mean more possible wrong choices and a longer tool selection step in each model call. Give the agent the minimum set that covers the task graph.

Here's a minimal tool registry for the PR review workflow:

import { execSync } from "child_process";
import { readFileSync } from "fs";

export const tools = [
  {
    name: "git_diff",
    description: "Get the unified diff for a pull request. Returns raw diff text.",
    input_schema: {
      type: "object" as const,
      properties: {
        base: { type: "string", description: "Base branch or commit SHA." },
        head: { type: "string", description: "Head branch or commit SHA." },
      },
      required: ["base", "head"],
    },
    handler: ({ base, head }: { base: string; head: string }) => {
      return execSync(`git diff ${base}...${head}`, { encoding: "utf-8" });
    },
  },
  {
    name: "run_tests",
    description: "Run the test suite for a list of test files. Returns combined stdout/stderr.",
    input_schema: {
      type: "object" as const,
      properties: {
        files: {
          type: "array",
          items: { type: "string" },
          description: "Paths to test files to run.",
        },
      },
      required: ["files"],
    },
    handler: ({ files }: { files: string[] }) => {
      try {
        return execSync(`npx vitest run ${files.join(" ")}`, {
          encoding: "utf-8",
          timeout: 120_000,
        });
      } catch (err: unknown) {
        const e = err as { stdout?: string; stderr?: string; message?: string };
        return `FAILED:\n${e.stdout ?? ""}\n${e.stderr ?? e.message ?? ""}`;
      }
    },
  },
];

Keep handlers thin. Business logic belongs in your application code, not in tool handlers. The handler executes and returns a string. The agent decides what to execute.

Step 3: Write the orchestrator

The orchestrator runs the agent through each step in the task graph, checks verification conditions, and handles failures. It's not the same as the agentic loop from a simple agent. This is a higher-level controller that decides when to move forward, retry, or abort.

import Anthropic from "@anthropic-ai/sdk";
import { tools } from "./tools.js";

const client = new Anthropic();

interface StepSpec {
  name: string;
  prompt: string;
  maxRetries?: number;
  verify?: (result: string) => boolean;
}

async function runWorkflow(steps: StepSpec[], context: string): Promise<void> {
  const workflowLog: string[] = [];

  for (const step of steps) {
    const maxRetries = step.maxRetries ?? 2;
    let attempts = 0;
    let success = false;

    while (attempts <= maxRetries && !success) {
      attempts++;
      console.log(`\n=== ${step.name} (attempt ${attempts}) ===`);

      const messages: Anthropic.MessageParam[] = [
        {
          role: "user",
          content: `${context}\n\nWorkflow log so far:\n${workflowLog.join("\n")}\n\nCurrent step: ${step.prompt}`,
        },
      ];

      const result = await runAgentStep(messages);

      if (step.verify && !step.verify(result)) {
        console.warn(`Verification failed for step: ${step.name}`);
        workflowLog.push(`STEP ${step.name}: FAILED VERIFICATION (attempt ${attempts})`);
        continue;
      }

      workflowLog.push(`STEP ${step.name}: ${result.slice(0, 300)}`);
      success = true;
    }

    if (!success) {
      throw new Error(`Step "${step.name}" failed after ${maxRetries + 1} attempts. Aborting workflow.`);
    }
  }

  console.log("\n=== Workflow complete ===");
  console.log(workflowLog.join("\n\n"));
}

async function runAgentStep(messages: Anthropic.MessageParam[]): Promise<string> {
  let localMessages = [...messages];

  for (let i = 0; i < 10; i++) {
    const response = await client.messages.create({
      model: "claude-sonnet-4-5",
      max_tokens: 4096,
      tools: tools.map(({ name, description, input_schema }) => ({
        name,
        description,
        input_schema,
      })),
      messages: localMessages,
    });

    if (response.stop_reason === "end_turn") {
      return response.content
        .filter((b) => b.type === "text")
        .map((b) => (b as Anthropic.TextBlock).text)
        .join("\n");
    }

    if (response.stop_reason === "tool_use") {
      localMessages.push({ role: "assistant", content: response.content });
      const results: Anthropic.ToolResultBlockParam[] = [];

      for (const block of response.content) {
        if (block.type !== "tool_use") continue;
        const tool = tools.find((t) => t.name === block.name);
        const result = tool
          ? String(tool.handler(block.input as never))
          : `Unknown tool: ${block.name}`;
        results.push({ type: "tool_result", tool_use_id: block.id, content: result });
      }

      localMessages.push({ role: "user", content: results });
    }
  }

  return "Step exceeded maximum iterations.";
}

The key design decision here: the orchestrator carries a workflowLog across steps, injected into every step's prompt. This gives the agent running step 4 full visibility into what steps 1-3 produced, without relying on the model's internal memory (which resets between runAgentStep calls).

Step 4: CI integration

An agentic coding workflow that only runs locally isn't that useful. The real value comes when it runs automatically: on every PR, on a schedule, or triggered by a webhook.

Here's a GitHub Actions workflow that runs the PR review agent:

# .github/workflows/agent-review.yml
name: AI Code Review

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  review:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
      contents: read

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # needed for git diff

      - uses: actions/setup-node@v4
        with:
          node-version: '20'

      - name: Install dependencies
        run: npm ci

      - name: Run agent review
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          PR_BASE: ${{ github.event.pull_request.base.sha }}
          PR_HEAD: ${{ github.event.pull_request.head.sha }}
          PR_NUMBER: ${{ github.event.pull_request.number }}
        run: npx tsx src/workflows/pr-review.ts

Two things to get right in CI: the fetch-depth: 0 flag (without it, git diff has no history to work with), and using GITHUB_TOKEN with the pull-requests: write permission so the agent can post its review comments.

Set a hard timeout on the job. Ten minutes is generous for a PR review, and agents that run over budget in CI will consume your GitHub Actions minutes fast.

Step 5: Observability

You can't improve what you can't see. Every agentic workflow needs structured logging from day one. At minimum, log the step name, model, token counts, tool calls, and wall-clock time per step.

interface StepTrace {
  step: string;
  model: string;
  inputTokens: number;
  outputTokens: number;
  toolCalls: { name: string; durationMs: number }[];
  totalDurationMs: number;
  success: boolean;
}

const traces: StepTrace[] = [];

Write traces to a JSON file per run and commit a summary to your PR as a comment. It sounds like overhead, but the first time you debug a workflow that failed on step 3 of a 5-step chain, you'll appreciate having the full trace.

For teams running many agent workflows, a logging aggregator like Datadog or a simple Postgres table beats flat files quickly. The Laxaar team uses a Postgres table with a fixed schema per project: step name, run ID, timestamp, token counts, and success flag. That covers 90% of debugging needs without adding infrastructure complexity.

Common pitfalls

Skipping verification conditions. It's tempting to trust that if the agent says a step succeeded, it did. Don't. Add verify functions to every step that has a checkable output, especially steps that touch external APIs or run tests.

One massive system prompt for all steps. If you inject the entire workflow spec into every step's system prompt, the agent spends token budget processing context it doesn't need. Pass only the current step's instructions and the workflow log.

No retry budget. Network calls fail. Tests have flakes. Models occasionally call the wrong tool. Build in at least one retry per step, with exponential backoff on HTTP errors. A workflow that fails on a transient network blip isn't useful.

Agent-edited files not committed. If the workflow modifies source files, you need an explicit step to commit and push those changes. The agent doesn't know your Git workflow. Add a final git commit && git push step, or use a bot token with push access.

Frequently Asked Questions

How do I decide which steps to automate vs. keep human-in-the-loop?

Automate steps that are deterministic, reversible, and well-defined. Keep humans in the loop for irreversible actions (deploying to production, sending emails), ambiguous requirements, and decisions that require business context the agent doesn't have. When in doubt, add a human gate. You can always remove it once you've validated the agent's judgment.

What happens if an agent modifies a file incorrectly and commits it?

This is why you want the agent to work on a branch, not main. Always configure your CI workflow to run the agent on a feature branch and open a PR for human review before merging. Treat agent-generated code the same as intern-generated code: review it before it ships.

How do I handle secrets in agentic workflows?

Never inject secrets as tool arguments or into prompts. Use environment variables and access them in tool handlers, the same way you would in any Node.js application. Make sure your agent's system prompt doesn't instruct it to print or log secrets.

Can I run multiple agents in parallel for different steps?

Yes, for steps with no data dependency between them. Steps 2 and 3 in the PR review example (analyze files, run tests) can run in parallel since they both read from the diff but don't depend on each other's output. Use Promise.all to fan out parallel steps and collect results before the next sequential step.

How much does an agentic workflow cost per run?

It depends heavily on model, task complexity, and number of tool-call rounds. A PR review workflow using Claude Sonnet 4.5 typically uses 5,000–20,000 tokens per run, costing $0.02–$0.08. Multiply by your PR volume to get monthly estimates. Caching the system prompt with Anthropic's prompt caching feature cuts repeated input costs by ~90%.

The Laxaar team designs and builds agentic workflows for engineering teams who want AI in their development pipeline without the reliability risk. Talk to us about your use case or explore our automation services.