Agent Memory Benchmark Comparison: Evaluating Memory Strategies for AI Agents

Pipeline Architecture

The Problem

Every agent project eventually runs into the same wall: what memory strategy should I use? Sliding window is simple but loses early context. Vector retrieval sounds smart but adds latency and cost. There is no principled way to compare them without building a full test harness from scratch.

NEO built Agent Memory Benchmark Comparison to give teams a standardized suite for measuring how each memory strategy performs across realistic agent tasks — before they commit to one in production.

Four Memory Strategies Under Test

Agent Memory Benchmark evaluates four distinct memory architectures against the same workloads. Sliding window keeps the last N turns of conversation in context — fast, free, but lossy when conversations run long. Summarization compresses older history into a rolling summary using a secondary LLM call, trading latency for context density. Vector retrieval encodes all prior turns into embeddings and fetches the top-K most semantically relevant chunks at query time, using FAISS for local vector search. Episodic memory organizes history into discrete episodes — self-contained interaction chunks with metadata — and retrieves entire episodes rather than individual turns.

Each strategy is implemented as a drop-in class behind a common MemoryBackend interface, so the same agent task runner can execute against all four without code changes.

Benchmark Metrics and Task Design

The suite measures four metrics per strategy. Recall accuracy tests whether the agent can correctly answer questions about information introduced earlier in the session — a direct measure of what was retained versus lost. Context utilization measures what fraction of the available context window is occupied by actually relevant content versus padding or noise. Cost per query sums token usage (input + output) across both the memory strategy's retrieval step and the final LLM call, priced against current API rates. Retrieval latency clocks the wall-time cost of fetching memory before the LLM call executes.

Tasks are drawn from a standardized set covering three categories: long-horizon QA (questions about facts stated 20+ turns back), cross-reference tasks (combining two pieces of information from different points in the session), and state tracking (following a variable that changes value multiple times). Each task category exposes a different failure mode in each strategy.

Strategy            Recall@20  Cost/1K tokens  P50 Latency
─────────────────────────────────────────────────────────
Sliding Window      0.61       $0.0018         12ms
Summarization       0.74       $0.0041         180ms
Vector Retrieval    0.83       $0.0029         95ms
Episodic Memory     0.79       $0.0033         140ms

Results are written to a structured JSON report and a rendered HTML table for easy comparison.

Configuration and Execution Model

Each benchmark run is configured via a YAML file that specifies which strategies to include, which task categories to run, the target LLM endpoint, and the number of trials per task. The runner executes each (strategy, task) pair in parallel using asyncio and aggregates results across trials to produce stable averages.

benchmark:
  strategies: [sliding_window, summarization, vector, episodic]
  task_categories: [long_horizon_qa, cross_reference, state_tracking]
  trials_per_task: 10
  llm:
    provider: openrouter
    model: qwen/qwen3.5-9b

A --strategy CLI flag lets you run a single strategy in isolation for debugging. The --export flag produces a CSV alongside the JSON report for loading into pandas or a spreadsheet for further analysis.

How to Build This with NEO

Open NEO in VS Code or Cursor and describe what you want to build. A good starting prompt for this project:

"Build a Python benchmarking suite that evaluates four agent memory strategies — sliding window, summarization, vector retrieval, and episodic memory — against standardized tasks. Measure recall accuracy, context utilization, cost per query, and retrieval latency. Use a common MemoryBackend interface so all strategies run against the same task runner. Output results as JSON and HTML reports with a YAML config system."

Build with NEO →

NEO generates the project structure and core implementation. From there you iterate — ask it to add a new memory strategy, extend the task library with domain-specific scenarios, or build a visual dashboard that plots recall vs. cost trade-off curves per strategy. Each request builds on what's already there without re-explaining the context.

To run the finished project:

git clone https://github.com/dakshjain-1616/agent-memory-benchmark_comparision
cd agent-memory-benchmark_comparision
pip install -r requirements.txt
python main.py --config config.yaml

Configure your target LLM and strategy list in config.yaml, run the suite, and get a structured report showing which memory strategy wins on which metric for your specific workload.

NEO built a memory benchmarking suite that runs standardized agent tasks across four retrieval strategies and produces quantitative recall, cost, and latency comparisons. See what else NEO ships at heyneo.com.

Try NEO in Your IDE

Install the NEO extension to bring AI-powered development directly into your workflow:

VS Code: NEO in VS Code
Cursor: Install NEO for Cursor