Prompt Engineering for Engineers

Most prompt engineering advice is written for people who want to chat with an AI more effectively. That's not what we're doing here. When you're building a production system that calls an LLM hundreds of thousands of times a day, prompt engineering is a software discipline with real failure modes, testable hypotheses, and measurable outcomes.

The problem is that most teams treat prompts as afterthoughts: written once, tweaked manually when something breaks, never tested systematically. That approach works fine during prototyping. It falls apart when you need consistent output quality at scale.

At Laxaar, prompt engineering is part of our standard development workflow for any LLM-backed feature. This guide covers the techniques that actually move the needle, with working code and the trade-offs we've learned to care about.

What you'll learn

Why prompts are code, not configuration
Structuring instructions for reliable output
Few-shot examples: when they help and when they don't
Chain-of-thought and reasoning traces
Getting structured output right
System prompt vs. user message: what actually matters
Testing and versioning your prompts
Frequently Asked Questions

Why prompts are code, not configuration

A prompt is a specification. It tells the model what task to perform, what constraints to respect, what format to produce. When the spec is ambiguous, the model fills in the gaps, and it won't fill them in the way you expect.

The same engineering practices that apply to code apply to prompts: version control, testing, review, and iteration based on measured outcomes. Teams that skip this step spend weeks debugging "the model does weird things sometimes" without ever identifying the root cause. Nine times out of ten, the root cause is an underspecified prompt.

Concretely: your prompts belong in source control, ideally in a dedicated directory with clear naming (prompts/extraction/invoice-v3.txt). Every change should be reviewed. Every change should be evaluated against a test set before it ships. We'll get to the evaluation tooling in LLM evaluation systems that catch regressions.

Structuring instructions for reliable output

The single biggest lever in prompt engineering is instruction clarity. Models follow explicit instructions more reliably than they infer intent from examples or context.

A few patterns that work consistently:

Lead with the task, not the context. State what you want in the first sentence. Context and constraints follow.

Use numbered steps for multi-part tasks. "Do X, then Y, then Z" in prose form is ambiguous; the model may skip steps or reorder them. An explicit numbered list forces sequential execution.

Name negative constraints explicitly. "Don't include information not present in the source" is more effective than hoping the model infers it from "be accurate."

# Poor: implicit, context-first
system_prompt = """
You are a helpful assistant that reads documents.
When given a contract, you help lawyers understand it.
Please be concise and accurate.
"""

# Better: task-first, explicit constraints
system_prompt = """
You are a contract analysis assistant.

Task: Extract the key terms listed below from the provided contract text.

Rules:
1. Return only information explicitly stated in the contract.
2. If a field is not present, return null — do not infer or estimate.
3. Use the exact terminology from the contract, not paraphrases.
4. Do not include commentary or explanation.

Fields to extract:
- party_names (list of strings)
- effective_date (ISO 8601 date or null)
- termination_clause (string or null)
- governing_law (string or null)
"""

The difference isn't subtle. The second prompt will produce consistent, parseable output. The first will produce helpful-sounding text that varies in structure every time.

Few-shot examples: when they help and when they don't

Few-shot prompting means including examples of correct input-output pairs in the prompt. It's one of the oldest prompt engineering techniques, and it still works, but the conditions where it helps are specific.

Few-shot examples are most valuable when:

The task has a non-obvious output format that's hard to specify in prose
The desired behavior involves subtle distinctions (e.g., sentiment that's sarcastic vs. genuinely positive)
You're working with a smaller or older model that needs more scaffolding

They're often not worth the token cost when:

The task is straightforward and well-represented in training data
You have a good structured output schema (the schema is itself an example)
The examples you have aren't representative enough of real inputs

from anthropic import Anthropic

client = Anthropic()

# Few-shot for a nuanced classification task
def classify_support_ticket(ticket_text: str) -> dict:
    examples = [
        {
            "role": "user",
            "content": "My dashboard won't load after the update yesterday."
        },
        {
            "role": "assistant",
            "content": '{"category": "bug", "urgency": "high", "product_area": "dashboard"}'
        },
        {
            "role": "user",
            "content": "How do I export my data to CSV?"
        },
        {
            "role": "assistant",
            "content": '{"category": "how-to", "urgency": "low", "product_area": "data-export"}'
        },
    ]

    messages = examples + [{"role": "user", "content": ticket_text}]

    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=256,
        system='Classify support tickets. Return JSON only — no explanation.',
        messages=messages,
    )
    return response.content[0].text

One thing that consistently surprises teams: example quality matters more than example quantity. Three well-chosen, diverse examples outperform ten similar ones. If all your examples show the easy cases, the model will underperform on the hard ones.

Chain-of-thought and reasoning traces

Chain-of-thought (CoT) prompting asks the model to show its reasoning before producing a final answer. The reasoning trace isn't just for interpretability; it measurably improves accuracy on tasks that require multiple logical steps.

The mechanism is simple: models that generate reasoning tokens are effectively conditioning each subsequent token on a chain of correct intermediate steps. Without the trace, the model jumps straight to an answer, which on complex tasks means the probability of a correct answer is much lower.

# Zero-shot CoT — works for many reasoning tasks
system = """
You are a data analyst assistant.

When given a question that requires calculation or multi-step reasoning:
1. Work through your reasoning step by step, labeled as "Reasoning:"
2. State your final answer labeled as "Answer:"

Do not skip the reasoning section.
"""

# Structured CoT output for programmatic parsing
system_structured = """
You are a data analyst assistant. Respond in this exact JSON format:
{
  "reasoning": "<your step-by-step analysis>",
  "answer": "<your final answer>",
  "confidence": "high" | "medium" | "low"
}
"""

A note on when to skip CoT: for simple lookup or classification tasks, CoT adds tokens without improving quality. It's worth the cost for tasks involving arithmetic, multi-hop reasoning, or decisions with competing constraints. If your task is "extract the date from this sentence," don't ask for reasoning. You're just burning tokens.

Extended thinking modes (Claude's extended thinking, OpenAI o-series models) take CoT further by allocating a reasoning budget before generating the response. They're significantly better on hard problems, but the cost and latency jump is real. Benchmark them on your specific task before committing.

Getting structured output right

Production systems rarely want free-form prose. They want JSON, or a typed object, or a predictable format they can parse and pass downstream. Getting this reliably is a solved problem, if you use the right tools.

The two main approaches are constrained decoding (the model's output is constrained at the token level to match a schema) and instruction-based formatting (you tell the model to produce JSON and validate the result). Constrained decoding is more reliable; instruction-based is more portable.

# Constrained output with OpenAI's structured outputs (Pydantic schema)
from openai import OpenAI
from pydantic import BaseModel
from typing import Optional

client = OpenAI()

class ContractSummary(BaseModel):
    party_names: list[str]
    effective_date: Optional[str]
    total_value: Optional[float]
    currency: Optional[str]
    termination_notice_days: Optional[int]

def extract_contract_summary(contract_text: str) -> ContractSummary:
    response = client.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",
        messages=[
            {"role": "system", "content": "Extract structured data from the contract."},
            {"role": "user", "content": contract_text},
        ],
        response_format=ContractSummary,
    )
    return response.choices[0].message.parsed

# Anthropic approach: tool use as a structured output mechanism
import anthropic
import json

client = anthropic.Anthropic()

extraction_tool = {
    "name": "extract_contract_data",
    "description": "Extract structured fields from a contract.",
    "input_schema": {
        "type": "object",
        "properties": {
            "party_names": {"type": "array", "items": {"type": "string"}},
            "effective_date": {"type": "string", "nullable": True},
            "total_value": {"type": "number", "nullable": True},
        },
        "required": ["party_names"],
    },
}

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    tools=[extraction_tool],
    tool_choice={"type": "tool", "name": "extract_contract_data"},
    messages=[{"role": "user", "content": contract_text}],
)

result = json.loads(response.content[0].input)

One opinionated take: don't ask the model to produce JSON and then try to parse whatever it returns. Use native structured output APIs when available. The reliability difference is not marginal. It's the difference between a 98% parse success rate and 100%.

System prompt vs. user message: what actually matters

Engineers new to LLM APIs sometimes treat the system prompt as a fixed configuration layer and the user message as the variable input. That's a reasonable starting model, but it's not the whole picture.

The distinction that matters operationally:

	System prompt	User message
Caching	Cached across requests (saves tokens + cost)	Typically not cached
Instruction precedence	Generally higher — model follows system over user when they conflict	User instructions can override system in some edge cases
Persona / tone	Set here	Shouldn't repeat
Task-specific context	Works well here for fixed context	Works well here for per-request context
Examples (few-shot)	Works here for shared examples	Works here for dynamic or per-request examples

Prompt caching is a real cost lever. With the Anthropic API, cache hits on the system prompt cost about 10% of the full input token price. For high-volume systems, keeping a stable, long system prompt and marking it for caching can cut your inference cost by 30–50%.

# Prompt caching with Anthropic (claude-opus-4-5 supports up to 4 cache breakpoints)
response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": long_stable_system_prompt,
            "cache_control": {"type": "ephemeral"},  # Cache this block
        }
    ],
    messages=[{"role": "user", "content": per_request_user_message}],
)

Testing and versioning your prompts

Prompt engineering without evaluation is guesswork. You need a test set, a metric, and a workflow for comparing prompt versions. This is where most teams underinvest.

The minimum viable setup:

A golden dataset — 50–200 examples with known correct outputs, covering both typical and edge cases.
An evaluator — a function (or another LLM call) that scores model output against the expected output.
A comparison workflow — run candidate prompt A and prompt B on the full dataset, compare aggregate scores.

# Minimal prompt evaluation loop
import asyncio
from dataclasses import dataclass

@dataclass
class EvalCase:
    input: str
    expected_output: str
    metadata: dict

async def evaluate_prompt(
    prompt: str,
    cases: list[EvalCase],
    model: str = "claude-opus-4-5",
) -> dict:
    results = []
    for case in cases:
        response = client.messages.create(
            model=model,
            max_tokens=512,
            system=prompt,
            messages=[{"role": "user", "content": case.input}],
        )
        actual = response.content[0].text
        # Score with a simple exact-match or LLM-based judge
        score = score_output(actual, case.expected_output)
        results.append({"score": score, "actual": actual, "expected": case.expected_output})

    return {
        "mean_score": sum(r["score"] for r in results) / len(results),
        "n": len(results),
        "results": results,
    }

For a deeper look at evaluation infrastructure (including LLM-as-judge, regression detection, and CI integration) see LLM evaluation systems that catch regressions.

Our generative AI expertise page has more on how Laxaar approaches production prompt pipelines.

Frequently Asked Questions

How long should a system prompt be?

There's no universal answer, but a useful heuristic: include everything the model needs to do the task correctly, and nothing it doesn't. We've seen effective system prompts range from 50 words to 3,000 words depending on task complexity. Long prompts aren't inherently bad; ambiguous prompts are. If you're including content "just in case," trim it. Every token in the prompt competes with every token in the response.

Do newer models need less prompt engineering?

Yes and no. Frontier models (GPT-4o, Claude Opus 4, Gemini 1.5 Pro) follow instructions more reliably and require less hand-holding for well-scoped tasks. But the ceiling on task complexity rises too. The tasks you'd actually use a frontier model for are harder, and harder tasks still benefit from careful prompting. The techniques in this guide apply across model generations; only the amount of scaffolding required changes.

Should I use XML tags or JSON structure inside prompts?

For Anthropic's Claude, XML tags work well as structural delimiters. Claude's training included a lot of XML-tagged content and it parses them reliably. For OpenAI models, markdown headers and numbered lists tend to work better. Neither is universally superior; use what matches the model's training distribution. The important thing is to be consistent within a prompt.

How do I handle prompt injection attacks in a production system?

Prompt injection is a real attack surface. The main mitigations: validate and sanitize user input before inserting it into prompts, use structural delimiters to separate trusted instructions from untrusted content (e.g., wrap user content in XML tags with a clear label), and treat any model output that will be executed or stored as untrusted data. For high-stakes systems, a classifier that flags injection attempts before they reach the main model is worth the added latency.

When should I fine-tune instead of prompt engineering?

Fine-tune when you have a large, high-quality labeled dataset, a stable and well-defined task, and a cost or latency constraint that base models can't meet through prompting alone. For most teams, prompt engineering gets you to 80–90% of fine-tuning quality at a fraction of the operational cost. Start with prompts, build your evaluation set, and treat the eval set as the training data for a fine-tuned model if and when you need to go further.

How do I manage prompts across a large codebase?

Store prompts in a dedicated directory in source control, separate from application code. Use template strings with named variables rather than f-strings scattered across the codebase. Version them with semantic versioning (or at minimum, include a version string in the filename). Add linting for banned phrases if you have style requirements. The teams at Laxaar use a simple prompt registry pattern (a dictionary mapping task names to versioned prompt templates) that makes swapping and evaluating alternatives straightforward.

Ready to build LLM features on a solid engineering foundation? Talk to Laxaar. We design and build production AI systems where prompt quality is measurable, not guesswork.