Context Engineering: Designing What the Model Sees

Spend a week tuning your prompt, get the retrieval wrong, and the model still fails. The quality of an LLM's output is bounded by the quality of its input. Not the model version, not the temperature setting, not the system prompt tone. It's the information the model actually sees at inference time that matters. Context engineering is the practice of deliberately designing that information: what to include, how to structure it, in what order, and what to leave out.

This is different from prompt engineering, though the two overlap. Prompt engineering focuses on the instructions you give the model. Context engineering focuses on the data and documents you bring into the conversation window alongside those instructions. For retrieval-augmented systems, agents, and any LLM feature that processes real-world data, context engineering is where most of the quality work happens.

At Laxaar we've watched teams spend weeks tuning prompts while their retrieval pipeline feeds the model irrelevant, stale, or poorly formatted documents. The prompt won't save you if the context is broken.

What you'll learn

What the context window actually contains
Information density: the most common mistake
Retrieval design: getting the right content in
Context ordering and the primacy-recency effect
Compression techniques for long documents
Dynamic context assembly
Measuring and iterating on context quality
Frequently Asked Questions

What the context window actually contains

The context window is the full text the model processes in a single forward pass. For a typical production LLM call, it contains some or all of the following:

System prompt. Instructions, persona, output format, constraints.
Conversation history. Prior turns, tool call results, assistant messages.
Retrieved documents. Chunks pulled from a vector store or search index.
User input. The current question, task, or message.
Tool schemas. JSON descriptions of available tools (for agent calls).

Each of these competes for the same finite token budget. A 128k-token window sounds generous until you account for a verbose system prompt (2–4k tokens), a dozen tool schemas (3–5k tokens), ten retrieved document chunks at 500 tokens each (5k tokens), and a long conversation history (10–20k tokens). You're at 20–30k tokens before the user has asked anything complex.

The constraint isn't just capacity: it's attention. Research on long-context models consistently shows degraded performance on information buried in the middle of a long context. This is the "lost in the middle" problem, and it's real enough to design around.

Information density: the most common mistake

Low information density is the context engineering equivalent of padding a function with dead code. You're consuming token budget without improving the model's ability to complete the task.

Common sources of low-density content:

Boilerplate headers in retrieved documents ("This document is confidential and intended only for…")
Redundant content: multiple retrieved chunks that say the same thing
Unconditional history inclusion: appending every prior conversation turn regardless of relevance
Verbose tool schemas: description fields that repeat the tool name in prose

The fix for most of these is preprocessing. Strip known boilerplate patterns before embedding or before inserting into context. Deduplicate retrieved chunks using exact or near-duplicate detection. Summarize completed conversation segments rather than carrying every token forward.

import re
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Strip common boilerplate patterns before storing or inserting documents
BOILERPLATE_PATTERNS = [
    r"This document is confidential.*?\.",
    r"For internal use only.*?\.",
    r"Page \d+ of \d+",
    r"DRAFT.*?—.*?\n",
]

def strip_boilerplate(text: str) -> str:
    for pattern in BOILERPLATE_PATTERNS:
        text = re.sub(pattern, "", text, flags=re.IGNORECASE | re.DOTALL)
    return text.strip()

# Deduplicate retrieved chunks by cosine similarity
def deduplicate_chunks(chunks: list[str], embeddings: np.ndarray, threshold: float = 0.92) -> list[str]:
    """Remove chunks that are too similar to already-selected chunks."""
    selected_indices = []
    selected_embeddings = []

    for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
        if not selected_embeddings:
            selected_indices.append(i)
            selected_embeddings.append(emb)
            continue

        sims = cosine_similarity([emb], selected_embeddings)[0]
        if max(sims) < threshold:
            selected_indices.append(i)
            selected_embeddings.append(emb)

    return [chunks[i] for i in selected_indices]

A useful mental model: every token in the context should earn its place. If you can remove it without the model needing it, remove it.

Retrieval design: getting the right content in

For RAG systems and agents that pull from knowledge bases, retrieval is the primary context engineering challenge. The model can only reason about what it receives. If retrieval misses relevant documents, no prompt technique compensates.

Retrieval quality has three dimensions:

Recall: are the relevant documents in the retrieved set?
Precision: are irrelevant documents excluded?
Ranking: are the most relevant documents ranked first?

Pure vector search (embedding similarity) optimizes for semantic recall but often sacrifices precision. BM25 keyword search optimizes for precision but misses paraphrase matches. Hybrid search, combining dense vector scores and sparse BM25 scores, consistently outperforms either alone on production workloads.

from qdrant_client import QdrantClient
from qdrant_client.models import SparseVector, NamedSparseVector
from fastembed import TextEmbedding, SparseTextEmbedding

# Hybrid search with Qdrant (v1.10+) using dense + sparse vectors
dense_model = TextEmbedding("BAAI/bge-small-en-v1.5")
sparse_model = SparseTextEmbedding("prithivida/Splade_PP_en_v1")

client = QdrantClient(url="http://localhost:6333")

def hybrid_search(query: str, collection: str, limit: int = 8) -> list[dict]:
    dense_vector = list(dense_model.embed([query]))[0].tolist()
    sparse_result = list(sparse_model.embed([query]))[0]
    sparse_vector = SparseVector(
        indices=sparse_result.indices.tolist(),
        values=sparse_result.values.tolist(),
    )

    results = client.query_points(
        collection_name=collection,
        prefetch=[
            {"query": dense_vector, "using": "dense", "limit": 20},
            {"query": sparse_vector, "using": "sparse", "limit": 20},
        ],
        query={"fusion": "rrf"},  # Reciprocal Rank Fusion
        limit=limit,
    )
    return [{"id": r.id, "score": r.score, "payload": r.payload} for r in results.points]

Beyond retrieval algorithm choice, chunk size and structure matter significantly. Chunks that are too small lose surrounding context; chunks that are too large dilute relevance scores and consume more token budget per retrieved item. We've found 300–600 token chunks with 50-token overlap work well for most document types, but the right size depends on your document structure and query patterns. Test both extremes before settling.

Context ordering and the primacy effect

Where information appears in the context window affects how much the model uses it. This is not theoretical. It's a documented property of transformer attention and has practical consequences for context design.

The pattern: models attend more reliably to content at the beginning and end of the context window than to content in the middle. For long contexts, important information buried in the middle is functionally invisible.

Position	Attention reliability
First 20% of context	High (model attends well)
Middle 60% of context	Degraded (information may be ignored)
Last 20% of context	High (recency effect)
Tool schemas	Place near system prompt (high attention)
Most relevant retrieved chunk	Place first in the document list
User query	Place last (recency effect works in your favor)

The practical rule: put the most important content first or last, never in the middle. For retrieved documents, sort by relevance score descending so the best chunk comes first. For conversation history with a summary, put the summary before the recent turns, not after.

One structural pattern that works well at Laxaar is a "working memory" block placed immediately before the user message. It holds the key facts the model needs for this specific turn: current task state, resolved entities, confirmed constraints. Keep it short (200–400 tokens) and position it last so it lands fresh in attention.

Compression techniques for long documents

When source documents are too long to include verbatim, you have three options: chunk and retrieve relevant sections, summarize to a compressed representation, or extract structured fields. Each has trade-offs.

Chunking with retrieval is the most common approach and works well when the relevant information is localized: a specific section of a legal document, a specific step in a technical manual. It fails when the answer requires synthesizing information spread across many sections.

Map-reduce summarization processes each section of a long document independently, then combines the section summaries into a final summary. It's better for synthesis questions but loses fine-grained detail.

Structured extraction converts the document into a typed schema, useful when the same fields are needed repeatedly (e.g., extracting line items from invoices). Once extracted, the structured representation is far more token-efficient than the original document.

from anthropic import Anthropic

client = Anthropic()

def map_reduce_summarize(document: str, chunk_size: int = 4000) -> str:
    """Summarize a long document using map-reduce."""
    # Split into chunks (simple split; use a proper splitter in production)
    words = document.split()
    chunks = [
        " ".join(words[i:i + chunk_size])
        for i in range(0, len(words), chunk_size)
    ]

    # Map: summarize each chunk
    chunk_summaries = []
    for i, chunk in enumerate(chunks):
        response = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=512,
            system="Summarize the following document section. Preserve key facts, numbers, and named entities. Be concise.",
            messages=[{"role": "user", "content": chunk}],
        )
        chunk_summaries.append(f"[Section {i+1}]\n{response.content[0].text}")

    # Reduce: combine summaries into a final summary
    combined = "\n\n".join(chunk_summaries)
    final = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        system="You are given section summaries from a long document. Synthesize them into a coherent overall summary. Preserve key facts and do not add information not present in the sections.",
        messages=[{"role": "user", "content": combined}],
    )
    return final.content[0].text

For agent systems with long-running tasks, context compression becomes necessary as conversation history grows. Summarizing completed steps into a progress note, then keeping only the note plus the last 3–5 turns in the active context, prevents context overflow without losing task state. This is a pattern worth building early, before you hit context limits in production.

Dynamic context assembly

Static context, the same documents and instructions for every request, is the simplest approach. It rarely stays adequate. Dynamic context assembly selects and structures context based on the specific request, user, and task state.

A workable assembly pipeline looks like this:

from dataclasses import dataclass
from typing import Optional

@dataclass
class ContextBlock:
    content: str
    token_estimate: int
    priority: int  # Lower number = higher priority; drop high numbers first if budget exceeded

def assemble_context(
    system_prompt: str,
    user_query: str,
    retrieved_docs: list[dict],
    conversation_history: list[dict],
    working_memory: Optional[str],
    token_budget: int = 90000,
) -> dict:
    """Assemble context blocks in priority order, respecting token budget."""

    blocks: list[ContextBlock] = []

    # Priority 1: system prompt (always included)
    blocks.append(ContextBlock(system_prompt, estimate_tokens(system_prompt), 1))

    # Priority 2: working memory (key facts for this turn)
    if working_memory:
        blocks.append(ContextBlock(working_memory, estimate_tokens(working_memory), 2))

    # Priority 3: retrieved docs, sorted by relevance score
    for i, doc in enumerate(sorted(retrieved_docs, key=lambda d: d["score"], reverse=True)):
        content = f"[Source: {doc['source']}]\n{doc['text']}"
        blocks.append(ContextBlock(content, estimate_tokens(content), 3 + i))

    # Priority 4: recent conversation history
    for turn in conversation_history[-10:]:  # Last 10 turns max
        content = f"{turn['role']}: {turn['content']}"
        blocks.append(ContextBlock(content, estimate_tokens(content), 20))

    # Priority 5: user query (always last, always included)
    blocks.append(ContextBlock(user_query, estimate_tokens(user_query), 0))

    # Fit within budget by dropping lowest priority blocks first
    total_tokens = sum(b.token_estimate for b in blocks)
    sorted_blocks = sorted(blocks, key=lambda b: b.priority)

    if total_tokens > token_budget:
        kept = []
        running_total = 0
        for block in sorted_blocks:
            if running_total + block.token_estimate <= token_budget:
                kept.append(block)
                running_total += block.token_estimate
        blocks = kept

    # Reconstruct in logical order (not priority order)
    return {"system": blocks[0].content if blocks else system_prompt,
            "messages": [{"role": "user", "content": "\n\n".join(b.content for b in blocks[1:])}]}

def estimate_tokens(text: str) -> int:
    # Rough estimate: 1 token ≈ 4 characters for English text
    return len(text) // 4

Priority-ordered context blocks with a hard token budget: that's the foundation of most production context management systems Laxaar has built. The exact priorities and budget are tuned per application, but the structure stays the same.

Measuring and iterating on context quality

You can't optimize what you don't measure. Context quality is best measured through its effect on downstream task performance: does better context produce more accurate, more complete model outputs?

The practical measurement loop:

Run a representative eval set against your current context assembly
For failures, inspect the context that was assembled for that request
Identify whether the failure was a retrieval miss, a relevance ranking error, or an information-density problem
Make a targeted change to the assembly pipeline
Re-run the eval set and compare

Logging the assembled context for every request (at least in staging) is essential for debugging. You can't diagnose retrieval problems if you can't see what the model actually received.

For context-specific metrics, we track: retrieved chunk relevance (scored by a lightweight judge model), context utilization (does the model actually cite or use the retrieved content?), and context-answer overlap (does the answer contain information not present in the context, which would indicate hallucination?).

For the evaluation infrastructure that ties this together, see LLM evaluation systems that catch regressions. And for the infrastructure that supports dynamic retrieval at scale, our AI data pipelines expertise covers the storage, indexing, and serving layer.

Frequently Asked Questions

What's the difference between context engineering and RAG?

RAG (Retrieval-Augmented Generation) is a specific architecture where a retrieval step populates the context before generation. Context engineering is the broader practice of designing everything in the context window, including how retrieved content is processed, ordered, and combined with other context components. All RAG systems benefit from context engineering; not all context engineering involves retrieval.

How do I know if my context is too long?

Two signals: task performance degrades on questions that require synthesizing information from multiple parts of the context (lost-in-the-middle failure), and your cost-per-request is higher than expected. Measure actual token usage per request and compare it against task success rates. If there's no correlation between context length and quality, you have room to trim. If quality drops when you trim, the content you're removing was load-bearing.

Should retrieved documents include their source metadata in context?

Yes, for two reasons. First, source attribution lets you trace hallucinations back to retrieval failures (if the model cites a source that didn't say what the model claims, that's a retrieval or model problem you can debug). Second, source metadata helps the model weight conflicting information. A recent official document should outweigh an older informal one, and the model can make that judgment if metadata is present.

How do I handle documents in different languages in a multilingual context?

Embed and retrieve in the source language using a multilingual embedding model (e.g., multilingual-e5-large or paraphrase-multilingual-mpnet-base-v2). Avoid translating documents before embedding; translation introduces errors and inflates storage costs. Include the source language as metadata. Most frontier models handle multilingual context natively, but consistency helps: if your system prompt is in English, consider including a brief translated summary of non-English documents alongside the original.

What's the right chunk size for retrieval?

It depends on your document structure and query types. A useful starting experiment: test 256-token, 512-token, and 1024-token chunks on your eval set. Measure recall at each size. Short chunks (256) tend to lose surrounding context; long chunks (1024+) tend to dilute relevance scores. For technical documentation with section-level structure, 400–600 tokens with semantic boundaries (split at headers, not mid-sentence) usually wins. Overlap between 10–15% of chunk size prevents boundary artifacts.

Can I use the model itself to decide what context to include?

Yes, and it's a useful pattern for complex routing decisions. A lightweight classifier model (or a fast call to a smaller model) can categorize the query and route it to the appropriate retrieval index or context template. For example: billing questions pull from billing documentation + account data; technical questions pull from product docs + recent changelog. This "meta-routing" step adds latency but can significantly improve context relevance for systems with multiple knowledge domains.

Context assembly is an engineering problem, not a prompt problem. If you want to build LLM systems that consistently produce high-quality output, talk to the Laxaar team. We design retrieval and context pipelines from the ground up.