Vol. XII · No. 05 · May 2026
Jake Cuth.
Field survey 2026 map of agent memory, charted from LongMemEval, LoCoMo, BEAM, RULER, ANN-Benchmarks, Stack Overflow 2025, Datadog AI 2026. Benchmarks: vendor-reported unless noted

Agent memory,
mapped.

Anthropic's CLAUDE.md is one extreme. A managed temporal knowledge graph is the other. Between them sit vector DBs, framework memory layers, long context windows, and the hybrids that ship in production. Four charts, one decision matrix, and a defensible answer to the question: is MEMORY.md really so bad?

The honest synthesis: MEMORY.md is the right answer for roughly 80% of solo coding-agent workflows and 0% of consumer chat products, with a large messy middle where hybrids dominate. Below: the benchmark scoreboard, the cost-vs-latency Pareto frontier, the when-to-use-what quadrant, and the adoption signals that matter.


Below a threshold, MEMORY.md is fine. Above it, fragments fast.

74.0% Letta filesystem agent on LoCoMo
94.87% Mastra Observational on LongMemEval
91% Mem0 p95 latency cut vs full context
$24M Mem0 Seed + Series A, Oct 2025

In September 2025, Anthropic shipped a memory tool. It is a directory. It supports five operations: view, create, str_replace, insert, delete. There is no vector store. There is no graph backend. It is a folder of files, edited by the model. By April 2026, a stock filesystem agent built on this primitive scored 74.0% on LoCoMo, beating Mem0's purpose-built graph variant at 68.5%. The result is awkward for the dedicated-memory category.

But it does not generalize. On LongMemEval, full-context GPT-4o lands at 60.2% to 64%, Mem0 v1 at 49.0% in independent re-tests, Zep/Graphiti at 71.2%, and the 2026 frontier (Mastra Observational, Mem0 v2, EverMemOS, TiMem) at 83% to 95%. The gap shows up whenever facts evolve, conflicts need resolving, or hundreds of sessions stack against the same user. File-based memory degrades non-linearly above that threshold.

The field is consolidating around a hybrid. Claude Code, Cursor, and Windsurf all combine a static instruction file (CLAUDE.md, .cursor/rules, global_rules.md) with just-in-time retrieval and, for some, a learned memory store. Vector DBs are roughly 78% enterprise-adopted but losing pricing power at the bottom; long context windows reach 1M to 10M tokens but suffer measurable context rot; dedicated frameworks (Mem0, Zep, Letta) are converging on roughly the same architecture. Below the threshold, plain files. Above it, hybrid.

Receipts · the launch arc
2023
Liu et al. publish "Lost in the Middle" with a 15-30% drop on middle-of-context retrieval
Apr 2024
OpenAI ships ChatGPT Memory (rolling-summary list, not vector DB)
Jan 2025
Zep/Graphiti paper: 94.8% DMR, 71.2% LongMemEval, 90% latency cut
Jul 2025
Chroma releases Context Rot report (18 LLMs, non-uniform degradation confirmed)
Sep 2025
Anthropic ships memory tool: a file directory, not a database
Sep 2025
Pinecone $50/month minimum, Weaviate $25/month. Hobby economics shift
Oct 2025
Mem0 raises $24M Seed + Series A; named exclusive memory provider for AWS Strands SDK
Jan 2026
TiMem (76.88%), EverMemOS (83.0%) push the frontier on benchmarks
Q1 2026
Mastra Observational hits 94.87% on LongMemEval with GPT-5-mini

Twelve approaches. One benchmark.

LongMemEval (Wu et al., ICLR 2025) is the closest thing the field has to a fair comparison. It tests cross-session memory, knowledge updates, temporal reasoning, and long-form QA. Naive full-context lands in the low 60s. Vector RAG with a single retrieval call lands lower. Knowledge graphs (Zep/Graphiti) push past 70%. Token-efficient and observational architectures cross 90%. The 2026 frontier is approaching the oracle ceiling, with caveats below. Hover any bar for the configuration and source.

S12.1 · LongMemEval-S overall score by approach
Naive full-context to OMEGA, in twelve bars.
Hover or focus a bar for the model class, source, and notes.
20 40 60 80 100 LongMemEval-S overall accuracy (%) Oracle ceiling 82.4% OMEGA (GPT-4.1) 95.4 Mastra OM (GPT-5-mini) 94.87 Mem0 v2 (token-efficient) 93.4 Emergence AI (gpt-4o) 86.0 Mastra OM (gpt-4o) 84.23 EverMemOS 83.0 Oracle (gpt-4o) 82.4 TiMem (gpt-4o-mini) 76.88 Zep / Graphiti (gpt-4o) 71.2 Full-context (GPT-4o, 115K) 62.1 Mem0 v1 (independent) 49.0 Llama-3.1-8B full-context 43.5
2026 frontier Memory framework Temporal graph Long context / vector Oracle ceiling
About the LongMemEval benchmark

LongMemEval-S has roughly 115K tokens of context and tests across five categories: single-session-assistant, multi-session, knowledge-update, temporal-reasoning, and single-session-user. Most quoted scores use LLM-as-judge, which introduces ~10% swing depending on judge prompt. Wu et al.'s ICLR 2025 standardized prompts are the closest thing to a fair-comparison anchor.

Self-reported "highest ever" claims (Mastra 94.87%, OMEGA 95.4%, agentmemory V4 96.2%) are vendor or single-author publications without independent replication. The original LongMemEval paper authors and the Letta team have flagged that single-pass scores often do not generalize. The honest framing is: 2026 frontier is approaching oracle, but the gap is narrower than vendor pages suggest.

Receipts · § II
Highest published score
Mastra OM 94.87% with GPT-5-mini, OMEGA 95.4% (self-reported)
Naive full-context (GPT-4o)
60.2-64% on the same 115K inputs
Oracle ceiling
82.4% (evidence sessions only, gpt-4o)
Vector RAG (Mem0 v1, independent)
49.0% in re-test by vectorize.io
Largest temporal-reasoning gain
Mem0 old to new: 51.1 to 93.2 (+42.1pp)

Cheaper. Faster. Pick two.

The benchmark scoreboard ignores money and clock time. This chart restores both. Each bubble is one approach, plotted by tokens-per-query against median latency. Bubble size encodes the LongMemEval score where available. The Pareto frontier (the dotted line) shows the best you can get for any given cost or latency. Approaches above and to the right of the frontier are dominated.

S12.2 · Tokens per query vs median latency
Where each architecture sits on the cost / latency map.
Bubble size = LongMemEval score. Hover any bubble for benchmark and price details.
1K 10K 100K 100 1M Tokens per query (log) 100ms 1s 10s 10ms Median latency (log) Pareto frontier
File-based Framework (extract + retrieve) Temporal graph Vector DB Long context only 2026 frontier
How prompt caching changes the cost calculus

Anthropic prompt caching makes the first call to a long static prefix cost full price, then drops subsequent calls to roughly 10% of base. Break-even arrives at ~2 cache hits. For a CLAUDE.md file that is read every session, caching takes the cost from $3.00/M input tokens (Sonnet 4.6) to $0.30/M, and the time-to-first-token by up to 85%.

That changes the answer to "is full-context too expensive?" The static answer was yes. The cached answer is "competitive on cost for repeated queries against the same history." For one-shot queries, file-based and structured memory still dominate.

Receipts · § III
Cheapest sustainable setup
File-based + caching at ~$0.0015/query
Lowest latency
pgvector / Qdrant at 5-8ms search + LLM time
Highest accuracy per token
Mastra Observational (94.87% at constant context)
Worst dominated
Naive full-context: pays for tokens, gets oracle minus 22pp
Frontier corner
Mem0 v2, Zep, Mastra OM all sit close to the optimum

One user or many? Static facts or evolving?

Most "which memory system" arguments collapse once you fix two variables: how many users you serve, and how often the facts they depend on change. File-based memory dominates one quadrant. Native product memory dominates another. Hybrid layers and temporal graphs split the remaining two. Hover a label for the recommended stack and the published evidence behind it.

S12.3 · Recommendation matrix by user count and fact volatility
The four quadrants of agent memory.
Hover any quadrant or item for the stack recommendation and the data that supports it.
Static facts ← → Evolving facts ← Single user Many users → Single user, static facts Single user, evolving facts Many users, static facts Many users, evolving facts MEMORY.md + grep + caching ~80% of solo coding-agent workflows Hybrid file + memory layer Coding agent across many repos Native memory summaries + retrieval ChatGPT, Claude.ai, Cursor Memories Mem0 / Zep framework + KG 10K+ users, sales, compliance, healthcare
File-based Hybrid Native product memory Memory framework
Why "long context" is a complement, not a quadrant winner

Long context windows (1M to 10M tokens) get treated as a substitute for memory, but the data does not support it. Chroma's Context Rot report (July 2025, 18 LLMs including GPT-4.1, Claude 4, Gemini 2.5, Qwen3) found non-uniform degradation as input length grows, even on simple repeated-words tasks.

Claude Opus 4.6 hits 76% on 8-needle MRCR at 1M tokens, vs Sonnet 4.5 at 18.5%, a 4x generation-on-generation jump. Claude Sonnet 4 holds <5% degradation across its 200K window. RULER (NVIDIA) shows only ~half of models claiming 32K+ context maintain quality at that length on multi-hop tasks. Long context belongs stacked with memory, not in place of it.

Receipts · § IV
Solo coder threshold
~30 facts, ~300 lines of MEMORY.md before quality degrades
Multi-user inflection
~10K users; below that, native memory wins, above it, frameworks
Temporal-reasoning king
Zep / Graphiti, +17.3pp LongMemEval vs full-context
Empty quadrant
None; every cell has at least one production-grade default
Cross-quadrant winner
Hybrid, in some form, ships in Claude Code, Cursor, Windsurf

Mindshare. Funding. Stack Overflow.

Benchmarks measure quality. Funding and stars measure where the field thinks the bets are. Mem0 leads dedicated-memory mindshare; Letta is positioning around stateful coding agents; Graphiti crossed 20K stars in November 2025. Vector DBs (Milvus, Qdrant, Weaviate, Chroma, pgvector) split the infrastructure layer. Stack Overflow's 2025 survey puts Redis at 43% as the top AI-agent data store, ahead of every dedicated vector database.

S12.4 · GitHub stars by project (April 2026 snapshot)
Where the dollars and the stars are pointing.
Hover any bar for funding rounds and other adoption signals.
10k 20k 30k 40k 50k . Mem0 ~51k Milvus ~25k Graphiti / Zep ~21k Letta ~30k Cognee ~12k Qdrant ~9k Weaviate ~8k Chroma ~6k Hindsight ~4k pgvector ~4k Redis (43% SO survey) 43% GitHub stars (K), April 2026 snapshot
Memory framework Temporal graph Vector DB Adoption anchor (non-star)
What the survey data really says

Stack Overflow Developer Survey 2025 (49,000+ respondents): 84% use or plan to use AI tools, up from 76% in 2024. 51% of professional developers use AI tools daily. But: AI agent adoption is 52% don't use them, 38% have no plans. Only 39.7% of AI engineering teams report using vector DBs / AI memory in their stack (Arize survey).

Datadog's State of AI Engineering (March 2026, telemetry from 1,000+ customers): AI agent framework adoption nearly doubled year-over-year (~9% to ~18%). Claude Sonnet 4.6 hit 17% adoption in its first month. No single model dominates: GPT-4o 22%, Sonnet 4.5 19%, Sonnet 4.6 17%, GPT-5.4 similar.

Receipts · § V
Largest funding round
Mem0 $24M Seed + Series A (Oct 2025)
Most-pulled vector DB
Milvus, ~700k Docker pulls/month
Top AI-agent data store
Redis, 43% (Stack Overflow 2025)
Pricing-driven migration
Pinecone $50/mo minimum (Sept 2025) drove hobby workloads to pgvector and Chroma
Vector-DB market
$2.38B (2025), projected $18.86B (2035), ~23% CAGR

Method

Five sections, four hand-coded SVG charts, one decision matrix. The data behind each chart is synthesized from a research brief spanning the LongMemEval and LoCoMo benchmark publications, ANN-Benchmarks, vendor papers (Mem0, Zep, Letta, Mastra, EverMemOS, TiMem, OMEGA), the Stack Overflow 2025 Developer Survey, and Datadog's March 2026 State of AI Engineering.

Charts are hand-coded SVG with no charting library. Tooltips are vanilla JavaScript, ~120 lines total. Page weight is well under one megabyte. No tracking beyond the platform-level Cloudflare Web Analytics.

Sources

Benchmarks: LongMemEval (Wu et al., ICLR 2025), LoCoMo, BEAM-1M / 10M, RULER (NVIDIA), Chroma Context Rot report.

Frameworks: Mem0 paper (Chhikara et al., ECAI 2025), Zep / Graphiti, Letta, Mastra Observational, EverMemOS, TiMem.

Adoption: Stack Overflow Developer Survey 2025 (49,000+ respondents), Datadog State of AI Engineering 2026 (1,000+ customers), Arize State of AI Engineering, Amplify Partners 2025 (n=500).

Anthropic: Memory tool announcement (Sept 2025), prompt caching docs, context editing reports.

Caveats

Vendor benchmarks are contested. Mem0, Zep, and Letta have publicly criticized each other's methodologies. Letta could not reproduce Mem0's MemGPT configuration; Zep accuses Mem0 of misconfiguring concurrent search; Mem0's CTO has counter-published. Treat any single-vendor LongMemEval or LoCoMo number as directional, not gospel. LLM-as-judge variation alone causes ~10% score swings.

Self-reported "highest ever" claims (Mastra 94.87%, OMEGA 95.4%, agentmemory V4 96.2%) are vendor or single-author publications without independent replication. Pricing is volatile: LLM API pricing dropped ~12x in 36 months, with 80% in the last year alone. Pinecone moved to $50/month minimum in September 2025; Weaviate followed at $25/month.

Context rot is real but model-dependent. Gemini 2.5 Flash shows near-perfect single-needle retrieval at 1M tokens (Google research, Nov 2025); Claude Sonnet 4 holds <5% degradation across 200K. Most other frontier models show the U-shape Liu et al. described. Don't generalize "context rot" to every frontier model, but don't dismiss it either.

What this lab is not

Not a production benchmark. Numbers are synthesized from a research brief. Before any production deployment decision, validate the relevant LongMemEval / LoCoMo configuration for your actual workload.

Not a recommendation against vector DBs. Vector RAG remains the default for >10M-document corpora. The argument is narrower: for cross-session agent memory, vector RAG alone underperforms hybrid memory layers by double-digit points on published benchmarks.

Not authored by any vendor. This is independent editorial analysis. Vendor materials are cited where applicable; the synthesis is the author's.

For the question "is there really a better solution than 'if you think something is important, keep it in MEMORY.md'?" the empirically-correct answer in 2026 is yes, but only above a clearly defined threshold.

Below the threshold (one user, one project, deterministic preferences, fewer than 30 facts, queries that fit in context after caching), MEMORY.md is not just adequate. It is on the Pareto frontier. The Letta filesystem result (74.0% LoCoMo, beating purpose-built memory frameworks) and Anthropic's decision to ship a literal file directory as their memory tool are the two strongest data points.

Above the threshold (multi-user, evolving facts, temporal reasoning, hundreds of sessions per user, conflicting updates that need to replace not append), file-based memory degrades non-linearly. Mem0 gained 53.6 percentage points on assistant-memory recall by moving from flat extract-and-store to v2 multi-signal retrieval. Zep gains 17.3pp on temporal queries by modeling fact validity windows. MEMORY.md is the right answer for ~80% of solo coding-agent workflows and ~0% of consumer chat products. The middle is where the field is consolidating.

FAQ

Is MEMORY.md actually a viable memory architecture for AI agents?

For solo coding agents on a single repo, yes. Letta's own evaluation showed a stock filesystem agent (GPT-4o-mini) hitting 74.0% on LoCoMo, beating Mem0's top graph variant at 68.5%. Anthropic's September 2025 memory tool is itself a file directory, not a vector database. With prompt caching, file-based memory is on the Pareto frontier for ~80% of solo workflows.

What's the highest LongMemEval score published?

Mastra Observational Memory hit 94.87% with GPT-5-mini, the highest published score. OMEGA reports 95.4% on GPT-4.1. For comparison: naive full-context GPT-4o lands at 60.2-64%, Mem0 v1 at 49.0% (independent test), Zep / Graphiti at 71.2%, Mem0 v2 token-efficient at 93.4% (self-reported). Treat any single-vendor benchmark as directional, not gospel.

Why are vector databases losing pricing power?

Pinecone introduced a $50/month minimum in September 2025; Weaviate followed at $25/month. Hobby and small workloads moved to pgvector and Chroma. Independent benchmarks show pgvectorscale at 471 QPS at 99% recall on 50M vectors, against Qdrant at 41.47 QPS, and Pinecone serverless at 1,763ms p95 vs pgvector's 63ms.

Does long context window solve memory?

No. Chroma's Context Rot report (July 2025, 18 LLMs) found models do not process context uniformly as input length grows. On LongMemEval-S (~115K tokens), GPT-4o full-context drops to 60.2-64% versus 87-92% oracle. Claude Sonnet 4 holds <5% degradation across 200K, and Claude Opus 4.6 hit 76% on 8-needle MRCR at 1M tokens, but most other frontier models still show the U-shape Liu et al. described in "Lost in the Middle" (2023).

Which memory framework should I use for a multi-user chatbot?

Mem0 if you want managed and a fast extract-and-update layer. Zep / Graphiti if you need temporal reasoning over evolving facts (sales, compliance, healthcare). Mem0 has 48-54k GitHub stars, raised $24M Seed + Series A in October 2025, and is AWS Strands SDK's exclusive memory provider. Zep gains +17.3pp on temporal LongMemEval queries vs full-context, with 90% latency reduction.