Agent Memory Systems Explained

Agents fail in two distinct ways: they make bad decisions, or they forget what they already decided. The second failure mode gets less attention but causes more production incidents. An agent that starts a multi-step task, fills its context window, and then re-derives the same wrong conclusion it already corrected three steps ago — that's a memory problem, not a reasoning problem. Agent memory systems are the structures that persist and retrieve information across tool calls, conversation turns, and task sessions. Getting them right is the difference between an agent that completes a 20-step workflow reliably and one that loses the thread after step 8.

At Laxaar we've built agents for document processing, customer support, and research automation. Memory architecture is consistently one of the top three engineering decisions that determines whether a system works in production.

What you'll learn

The four types of agent memory
In-context memory: the simplest form
External memory with vector stores
Episodic memory: learning from past runs
Semantic memory: structured world knowledge
Choosing and combining memory types
Frequently Asked Questions

The four types of agent memory

Agent memory falls into four categories borrowed loosely from cognitive science and adapted for LLM systems:

In-context memory. The token window itself: fast but limited and lost when the session ends.
External memory. A database (usually a vector store) the agent retrieves from via tool calls: persistent but adds latency.
Episodic memory. Records of past agent runs stored and retrieved to inform future runs, giving the agent a history of what worked.
Semantic memory. Structured facts about the world or the domain stored in a retrievable format, separate from the agent's reasoning.

These aren't mutually exclusive. Most production systems combine two or three. The choice is about which information needs to survive what kind of boundary: a tool call, a conversation turn, or a complete session restart.

In-context memory: the simplest form

In-context memory is everything in the model's active token window: the system prompt, conversation history, tool call results, and any data passed directly as text. It requires zero infrastructure and is the default starting point.

# Managing context with LangGraph's message history (v0.2+)
from langgraph.graph import StateGraph, MessagesState
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o")

def agent_node(state: MessagesState):
    # The full message history is in state["messages"]
    # The model sees all prior tool calls and results
    response = llm.invoke(state["messages"])
    return {"messages": [response]}

# Trim context to avoid hitting token limits
from langchain_core.messages import trim_messages

def trim_node(state: MessagesState):
    trimmed = trim_messages(
        state["messages"],
        strategy="last",
        token_counter=llm,
        max_tokens=12000,
        include_system=True,
    )
    return {"messages": trimmed}

The token limit is the hard constraint. GPT-4o's 128k context sounds large until you account for a verbose system prompt, tool schemas, and a 30-turn conversation with multi-paragraph tool outputs. In practice, most ReAct agents start hitting degradation around 40–60k tokens of actual content.

Context trimming is the obvious mitigation, but naive trimming (drop the oldest messages) loses information. Smarter approaches summarize completed steps into a compressed "progress note" and keep only the summary plus recent turns. This is worth implementing early. Retrofitting it after you've shipped is painful.

External memory with vector stores

External memory is a vector database the agent queries via a tool call. The agent embeds a query, retrieves semantically similar chunks, and includes them in its context only when needed. Information persists across sessions because it lives outside the token window.

# Vector store retrieval tool with LangChain + Chroma (v0.2+)
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.tools import tool

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(
    collection_name="agent_knowledge",
    embedding_function=embeddings,
    persist_directory="./chroma_db",
)

@tool
def search_knowledge_base(query: str) -> str:
    """Search the knowledge base for relevant information."""
    docs = vectorstore.similarity_search(query, k=4)
    return "\n\n".join([
        f"[Source: {doc.metadata.get('source', 'unknown')}]\n{doc.page_content}"
        for doc in docs
    ])

# Writing to memory after a completed task
def store_task_result(task_id: str, result: str, metadata: dict):
    vectorstore.add_texts(
        texts=[result],
        metadatas=[{"task_id": task_id, "timestamp": metadata["ts"], **metadata}],
    )

Retrieval quality determines how useful external memory is. Embedding distance is a blunt instrument: it finds semantically similar text, but "similar" doesn't always mean "relevant to this task right now." Hybrid search (dense embeddings + BM25 keyword matching) consistently outperforms pure vector search in our deployments. Pinecone, Weaviate, and Qdrant all support hybrid search natively.

The write side gets less attention than it deserves. What you store, how you chunk it, and what metadata you attach determines whether retrieval works at all. Time-decay metadata (so stale results score lower), source attribution, and structured tags for task type all pay off.

Episodic memory: learning from past runs

Episodic memory stores records of past agent runs (the goal, the steps taken, the tools called, the outcome) so future runs can retrieve and learn from them. It's the mechanism that lets an agent get better at a task over repeated attempts without retraining the model.

# Storing and retrieving episodic memories
from dataclasses import dataclass
from datetime import datetime
import json

@dataclass
class Episode:
    task_description: str
    steps: list[dict]
    outcome: str  # "success" | "failure" | "partial"
    error_encountered: str | None
    duration_seconds: float
    timestamp: str

def store_episode(episode: Episode, vectorstore):
    # Create a searchable summary of what happened
    summary = f"""
Task: {episode.task_description}
Outcome: {episode.outcome}
Steps taken: {len(episode.steps)}
Key actions: {', '.join(s['tool'] for s in episode.steps if 'tool' in s)}
Error: {episode.error_encountered or 'none'}
    """.strip()
    
    vectorstore.add_texts(
        texts=[summary],
        metadatas=[{
            "type": "episode",
            "outcome": episode.outcome,
            "timestamp": episode.timestamp,
            "full_episode": json.dumps(episode.__dict__),
        }]
    )

def retrieve_relevant_episodes(task: str, vectorstore, k: int = 3) -> list[Episode]:
    docs = vectorstore.similarity_search(
        task,
        k=k,
        filter={"type": "episode"},
    )
    return [Episode(**json.loads(d.metadata["full_episode"])) for d in docs]

The system prompt for an agent with episodic memory includes a compressed version of retrieved episodes: "In a similar past task you tried X and it failed because Y. In another run you succeeded by doing Z first." In our production systems at Laxaar, adding this context is the single change that most reliably improves success rates on repeated task types. No retraining. Just retrieval.

Episodic memory requires discipline about what you store. Failed runs are often more valuable than successful ones (they document what not to do), but you need to store enough context to make the failure legible to a future model instance.

Semantic memory: structured world knowledge

Semantic memory is structured, factual knowledge about the domain. Not a record of what the agent did, but what it knows. Think of it as the agent's reference library: product catalogs, API schemas, user profiles, configuration data.

The implementation varies widely. Simple cases use a key-value store with direct lookup. More complex cases use a graph database (Neo4j is common) to capture relationships between entities. The agent gets a tool to query the semantic store, and the returned facts supplement its in-context reasoning.

# Simple semantic memory with Redis for fast lookup
import redis
import json

class SemanticMemory:
    def __init__(self, redis_url: str):
        self.client = redis.from_url(redis_url)
    
    def store_fact(self, entity: str, attribute: str, value: any, ttl_seconds: int = None):
        key = f"semantic:{entity}:{attribute}"
        self.client.set(key, json.dumps(value), ex=ttl_seconds)
    
    def get_fact(self, entity: str, attribute: str) -> any:
        key = f"semantic:{entity}:{attribute}"
        raw = self.client.get(key)
        return json.loads(raw) if raw else None
    
    def get_all_facts(self, entity: str) -> dict:
        pattern = f"semantic:{entity}:*"
        keys = self.client.keys(pattern)
        return {
            k.decode().split(":")[-1]: json.loads(self.client.get(k))
            for k in keys
        }

# Agent tool wrapping semantic memory
@tool
def get_user_preferences(user_id: str) -> str:
    """Retrieve stored preferences and facts about a user."""
    memory = SemanticMemory(redis_url="redis://localhost:6379")
    facts = memory.get_all_facts(f"user:{user_id}")
    return json.dumps(facts) if facts else "No stored preferences found."

The key design question for semantic memory is freshness. Facts go stale (product prices change, API schemas get updated, user preferences evolve). TTL-based expiration is the simplest control; event-driven invalidation (update semantic memory when the source of truth changes) is more correct but requires more plumbing.

Choosing and combining memory types

No single memory type handles all cases. Here's how to think about which to reach for:

Memory type	Persists across sessions	Latency	Scales to	Best for
In-context	No	None	~100k tokens	Short tasks, conversation state
External (vector)	Yes	50–200ms	Millions of docs	Knowledge retrieval, long-term facts
Episodic	Yes	50–200ms	Thousands of runs	Task improvement, failure avoidance
Semantic	Yes	under 10ms (cache)	Domain-dependent	Structured facts, entity data

The pattern we reach for most at Laxaar is in-context + external vector. In-context handles the active task state; the vector store handles persistent knowledge and retrieval. Add episodic memory when the agent runs the same class of task repeatedly and you want it to improve. Add semantic memory when you have structured domain data that doesn't fit well in a vector store.

One opinionated take worth stating: episodic memory is underused relative to its value. Most teams build retrieval pipelines for documents but never think to store and retrieve records of their agent's own past runs. It's low-infrastructure (the same vector store you're already using) and produces measurable improvement in success rates on repeated task types.

For the architectural context around these memory choices, see AI agent architectures compared. Our AI data pipelines expertise covers the infrastructure side: how to build reliable ingestion and retrieval pipelines for production agent systems.

Learn more about how Laxaar approaches memory design in full agent deployments on our AI agents page.

Frequently Asked Questions

How much context window do I actually need for a production agent?

It depends on your tool output sizes. A conservative rule: budget 2k tokens for system prompt and tool schemas, 1–2k per tool call result, and 500 tokens per conversation turn. A 10-step task with verbose tool outputs can easily consume 30–40k tokens. Start with gpt-4o's 128k window, add context trimming, and measure actual usage in your specific workload before optimizing further.

When should I use a vector store vs. a traditional database for agent memory?

Use a vector store when retrieval is semantic: find information related to X. Use a traditional database (Postgres, Redis) when retrieval is exact: get the record for user ID 12345. Most production systems use both: a vector store for knowledge retrieval and a relational or key-value store for structured entity data and session state.

Can agents write to their own memory during a task?

Yes, and it's a useful pattern. An agent can call a store_memory tool during execution to checkpoint progress, record intermediate conclusions, or flag information for future runs. The risk is unbounded writes: an agent that writes aggressively can pollute its own knowledge base with noise. Use structured schemas for writes and consider a review step before episodic memories are promoted to the retrieval pool.

How do I prevent the agent from retrieving outdated information?

Two mechanisms work well together: TTL-based expiration (facts expire after a defined period) and recency scoring (recent documents get a retrieval score boost). Most vector stores support metadata filtering. Store a created_at timestamp and filter out documents older than your freshness threshold. For critical facts, event-driven invalidation tied to your source-of-truth system is the most reliable option.

What's the difference between episodic memory and fine-tuning?

Episodic memory retrieves examples of past runs at inference time and includes them in the prompt. Fine-tuning bakes knowledge into the model weights. Episodic memory is faster to update (add a new episode, it's immediately available), cheaper, and more interpretable. Fine-tuning produces faster inference and handles very high-frequency patterns better. For most agent use cases, episodic memory is the right starting point. Fine-tune only when you have high-quality, stable training data and the inference speed gain justifies the cost.

Does context window size affect which memory architecture to use?

Directly. Larger context windows make in-context memory more attractive; you can include more history before trimming. But cost scales with tokens too, so a 128k context used carelessly can be expensive at volume. External memory lets you keep the active context lean by pulling in only what's relevant. We generally recommend designing for lean context regardless of window size. It's cheaper and produces more focused reasoning.

Need help designing the memory layer for your agent system? Get in touch with Laxaar. We can review your architecture and help you avoid the retrieval anti-patterns that cause failures at scale.