Agent memory · Primer + opinionated survey · June 2026

Memory consolidation should be refactoring, not summarization.

If an agent owns a wiki, the wiki eventually turns into a garage. It stores everything, which means it remembers nothing cleanly. This page is a primer on the memory work people are actually doing, plus my argument for the missing piece: memory systems should write their own tests before they rewrite themselves.

Start with the survey Read the proposal Jump to memory tests

+8.3 pts

ReasoningBank gain on WebArena from structured reasoning memories over memory-free ReAct.

57.22%

AMA-Bench best reported accuracy for trajectory memory, a low number that should make everyone uneasy.

1,540

LoCoMo questions in the current conversational-memory benchmark family.

10M

BEAM-scale context where memory systems still degrade sharply under long histories.

The problem

A wiki is not a memory. It is a place memories go to become untestable.

I’ve started to think the default agent-memory pattern has a trap in it. You give the agent a wiki or a knowledge base, you let it edit the wiki, then you add a nightly consolidation job because the wiki gets too large to fit cleanly into context.

That sounds sensible. It is sensible. But it quietly moves the hardest part of memory into an opaque batch job: deciding which details matter, which observations should become general rules, which general rules are too broad, and which facts must stay exact because the user or the task depends on them.

In software, this would be an obviously dangerous refactor. You would not merge twenty files, rename the public interface, delete a bunch of “duplicate” code, and ship it without running tests. But that is basically how a lot of agent memory consolidation works today.

The failure mode

Compression is cheap. Synthesis is expensive.

The useful move is not turning ten notes into one shorter note. The useful move is turning ten observations into a pattern the agent can actually use later.

Raw memoryepisodes, quotes, errors, preferences, tool traces

→

Synthesisdedupe, cluster, generalize, preserve exceptions

→

Consolidated memorysmall enough to use, rich enough to matter

↘

Memory testsquestions, probes, regression cases, source links

The missing artifact is the box at the bottom. The agent should not merely produce a cleaner memory. It should produce a test suite that guards the meaning of the old memory.

State of practice

Practitioners are converging on a few patterns, even though nobody has the final architecture.

The boring version of agent memory is “store, embed, retrieve.” The real version is messier. Teams are learning that memory has a write path, a management path, a retrieval path, a privacy model, and a user experience. If any one of those is weak, the memory starts to feel haunted.

Memory is application-specific.

LangChain’s most useful point is also the simplest: a coding agent, a sales agent, and a personal assistant should not remember the same kinds of things. Practitioners are moving away from generic “remember everything” stores toward domain-shaped memories.

Writes are moving off the hot path.

Some systems let the agent explicitly save memory during a response. Others run extraction and consolidation in the background. The background path is attractive because it avoids latency and keeps memory hygiene separate from task execution.

Retrieval needs more than vectors.

Mem0’s recent work is a good example: semantic similarity helps, but entity linking, keyword matching, temporal handling, and rank fusion matter. The lesson is that memory retrieval is not just RAG with a nicer name.

What people are doing	Why they are doing it	What they are learning
Extracting structured facts from conversations	Raw chat logs are too noisy and too expensive to keep stuffing into context.	Extraction helps, but lossy extraction can erase causality, source, and exceptions.
Using memory stores with namespaces, metadata, and user controls	Production memory needs deletion, provenance, scoping, and tenant boundaries.	The hard part is not storage. The hard part is deciding what is safe and useful to store.
Combining episodic and semantic memory	Agents need exact events for recent tasks, but compact patterns for long-term behavior.	The transition from episode to semantic rule is where most silent damage happens.
Adding decay, TTLs, and pruning	Unbounded memory creates cost, privacy, latency, and retrieval-interference problems.	Forgetting is useful, but deletion without tests is just another way to lose the plot.
Benchmarking with LoCoMo, LongMemEval, BEAM, and agentic evals	People need measurable signal beyond “the demo remembered my favorite coffee.”	Benchmarks expose recall failures, but local product memory still needs local tests.

Research map

The field is converging on experience reuse, but not yet on memory regression tests.

The pieces are scattered across reflection, virtual context, memory benchmarks, continual test-time learning, selective forgetting, and old-fashioned software engineering practice.

Reflection works, but it can lie.

Reflexion showed that agents can improve by turning feedback into verbal self-critiques stored in episodic memory. It even used self-evaluation and self-written unit tests in programming tasks. The weakness is also obvious: a reflection can be wrong, overfit, or too vague to protect a future consolidation.

Reasoning memories beat raw logs.

ReasoningBank is the cleanest recent signal. It extracts titled, reusable reasoning memories from both success and failure, then retrieves them during future tasks. Google reports +8.3 points on WebArena and +4.6 on SWE-Bench Verified.

Memory evals are getting more agentic.

LoCoMo and LongMemEval test long-context conversational recall. AMA-Bench moves closer to the real problem by testing memory over states, actions, observations, tool outputs, objectives, and causal dependencies.

Thread	Representative work	What it teaches	Gap this essay targets
Reflection memory	Reflexion, Generative Agents	Agents can synthesize experience into natural-language lessons.	The lesson itself needs a test, not just a place in memory.
Virtual context	MemGPT / Letta	Agents can manage memory tiers through explicit write/read operations.	Memory operations need correctness criteria before compaction.
Reasoning memory	ReasoningBank	High-level reasoning patterns from success and failure outperform raw trajectories.	Consolidation is left as future work. The regression harness is missing.
Self-evolving memory	Evo-Memory / ReMem	Benchmarks are shifting from static recall to streaming experience reuse.	Benchmarks measure systems, but an individual agent needs local tests for its own memory.
Memory benchmarks	LoCoMo, LongMemEval, BEAM, AMA-Bench	Long-horizon memory degrades in measurable ways, especially under scale and causal retrieval.	Global benchmarks do not tell you whether last night’s wiki rewrite lost Akshay’s actual preference.
Software practice	Unit tests, golden tests, migration tests	Refactors are safe only when behavior is pinned by tests.	Agent memory lacks the equivalent of a regression suite.

Benchmark field guide

Today’s memory evals are mostly asking: can you recover the right shard from a long, messy life?

The public evals are not all testing the same thing. Some are conversational recall tests. Some are long-context retrieval tests. The newer ones are closer to agent memory: they test contradiction handling, preference drift, temporal reasoning, abstention, and whether the right memory changes behavior.

Eval	What it feels like	What it stresses	What it misses
LoCoMo	A small social world spread across many conversations.	Single-hop recall, multi-hop joins, temporal questions, and open-domain inference. The Mem0 benchmark harness scores 1,540 non-adversarial questions across those categories.	It is still mostly QA over memories, not action selection. A system can answer the question without proving it will behave better tomorrow.
LongMemEval	A personal assistant trying to answer questions after hundreds of sessions.	Knowledge updates, preference recall, assistant-stated facts, multi-session reasoning, and date arithmetic relative to the question date.	It reveals whether retrieval works, but not whether consolidation was safe. It usually does not ask, “did your nightly rewrite destroy the exception?”
BEAM	A harder production-shaped suite where histories can reach 1M–10M tokens.	Preference following, instruction following, contradiction resolution, event ordering, temporal reasoning, abstention, summarization, and multi-session reasoning.	It gets closer to production scale, but it is still an external benchmark. It cannot know the local invariants of one agent’s wiki.
AMA-Bench	An autonomous-agent memory test, not just a chat-memory test.	Memory over states, actions, observations, tool outputs, objectives, and causal dependencies in trajectories.	The reported best trajectory-memory accuracy is still low enough to suggest the category is immature, not solved.

LoCoMo-style · multi-hop

Join two facts across sessions

Memory A: “Sam mentioned that Priya won first place at the spring recital.”

Memory B: “Priya later said the piece she performed was Clair de Lune.”

Probe: “What piece was associated with Priya’s first-place recital?”

Expected: Clair de Lune.

LongMemEval-style · knowledge update

Prefer the newest compatible fact

Older memory: “The user lives in New York and likes nearby weekend hikes.”

Newer memory: “The user moved to San Francisco in March.”

Probe: “Where should the assistant assume the user currently lives?”

Expected: San Francisco, while preserving that New York is the prior location.

BEAM-style · abstention

Know when memory is absent

Memory: “The user compared two standing desks and bought the oak one.”

Probe: “What warranty length did the user choose?”

Expected: “I don’t have enough information,” not a plausible warranty guess.

Agent-memory-style · causal trajectory

Remember why a strategy changed

Trajectory: The agent tried a direct patch, tests failed, it inspected logs, discovered schema drift, then fixed the migration.

Probe: “Before trying another direct patch in this repo, what should the agent check?”

Expected: Check migration/schema drift first; that failure mode caused the earlier bad patch.

The important intuition: benchmark questions are already tiny memory unit tests. The production move is to generate local versions automatically from the agent’s own history, then run them before and after consolidation.

The proposal

Make every consolidation job prove that it did not forget the wrong things.

The concrete proposal is simple: when an agent writes or updates durable memory, it should also create small eval items that represent what must remain recoverable after future consolidation.

These evals do not need to be grand benchmark questions. In fact, they should usually be small. “What model alias does Akshay prefer for Gemini Pro?” “Why did the agent decide not to add a source to the weekly refresh?” “What is the exception to the general rule about memory decay?” Small tests are the point.

Then, before the agent merges files, deletes notes, rewrites a wiki page, or replaces five daily observations with one synthesized rule, it runs the eval suite against the post-consolidation memory. If answers fail, the agent either repairs the consolidation or escalates the ambiguity to a human.

Consolidation and forgetting

The fix for wiki explosion is not deletion. It is tiering, decay, promotion, and tested compaction.

The degenerate case is predictable: if every observation becomes a durable file, the agent creates thousands of weakly named memories, retrieval gets noisy, and the wiki becomes an adversarial context generator. A real system needs a memory lifecycle.

1. Score before storing forever

Recent systems use importance, recency, relevance, authority, surprise, and reuse frequency. Generative Agents popularized recency/importance/relevance; newer architectures add temporal validation, source authority, and interference scores.

2. Consolidate offline

LangGraph’s framing is right: memory writes can happen in the hot path or in the background. Consolidation should usually be background work, because it is slower, reflective, and needs evals.

3. Forget by policy, not vibes

The better papers treat forgetting as a feature: TTL expiration, exponential decay, interference pruning, safety-triggered deletion, and graceful degradation from full episode → summary → gist → tombstone.

Hotcurrent task/session

Full-fidelity context. Short TTL. No consolidation yet.

Warm episodicdays to weeks

Raw events with provenance, embeddings, timestamps, importance, and source authority.

Candidate semanticsleep phase

Clustered observations proposed as a rule, preference, fact, exception, or procedure.

Mature semanticlong-term

Tested, compact memory with source links and regression probes.

Tombstoneforgotten-but-auditable

Minimal record that something was removed, why, and under which policy.

Mechanism	How people use it	Failure it prevents	Risk
TTL / passive decay	Short-lived events expire unless promoted. Microsoft’s recent human-inspired architecture uses hot, warm, and long-term tiers with TTLs and score decay.	Unbounded growth from greetings, transient task chatter, and stale observations.	Can delete rare but critical facts unless protected by priority or tests.
Recall reinforcement	MemoryBank-like systems strengthen memories when they are recalled; human-like models reduce decay for repeatedly useful memories.	Useful facts disappearing only because they are old.	Can reinforce noisy memories if retrieval is already biased.
Importance promotion	Batch jobs score events and promote only high-value ones to semantic memory. Some designs promote top bands, retain middle bands, prune low bands.	Every event becoming a permanent wiki page.	Importance scorers are easy to game and bad at “boring but vital” facts.
Interference pruning	Detect many similar memories that compete at retrieval time; merge or degrade low-value duplicates.	The “thousands of wiki files” failure where retrieval drowns in near-duplicates.	May collapse distinct exceptions into one false generalization.
Reconsolidation	When a memory is retrieved and contradicted, open a short update window instead of letting stale facts persist indefinitely.	Old preferences and facts surviving after the user changed their mind.	Can overwrite useful historical context unless the old and new facts are represented as a transition.
Safety-triggered forgetting	Delete or quarantine secrets, prompt-injection payloads, harmful instructions, or privacy-sensitive data.	Memory poisoning and retention of data the system should not keep.	Needs audit logs; silent deletion can hide why behavior changed.

My recommendation: do not let agents create arbitrary new wiki files in the hot path. Let them append raw events cheaply, then run a scheduled “sleep” job that clusters, proposes semantic memories, generates tests, and only then edits the durable wiki.

Memory unit tests

A memory test is a contract between past context and future behavior.

It says: after you compress, dedupe, reorganize, or generalize this memory, the future agent must still be able to answer this probe with the right nuance.

Example memory testpreference · high priority · human-verifiable

Source memory

Akshay prefers concise responses by default, but wants depth when the topic warrants it. He wants source labels distinguishing knowledge-base evidence, live lookup, and Boswell’s reasoning.

Probe

When replying to Akshay about an AI-field claim, what source-labeling habit should the agent preserve?

Expected answer

It should briefly label whether claims came from the knowledge base, live lookup, or Boswell’s own reasoning, while staying concise unless depth is warranted.

Failure condition

The answer omits source labels, treats all claims as memory, or expands into a long generic report without being asked.

Fact tests

Exact things that must survive: names, dates, paths, APIs, model aliases, account boundaries.

Preference tests

User-specific norms: tone, format, risk tolerance, default workflows, what not to do.

Pattern tests

Generalized lessons: “prefer file memory first” or “treat memory writes as privileged state mutation.”

Exception tests

The detail that stops a generalized rule from becoming false: “decay is not safe for allergies.”

The consolidation loop

The agent should quiz itself, repair itself, then ask for help when the test is suspect.

1Extract candidate memories

2Generate probes and expected answers

3Consolidate into smaller wiki

4Run memory eval suite

5Repair failures or preserve source detail

6Escalate dubious tests to a human

How to know if consolidation was good

Measure retention, usefulness, and distortion separately.

A consolidation can pass fact recall and still be bad because it destroyed the pattern. It can preserve the pattern and still be bad because it erased the exception.

Retention

Can the agent still recover the specific fact, preference, decision, source, or exception after the memory rewrite?

Generalization

Can it apply the consolidated pattern to a new but similar case without rereading the raw log?

Non-distortion

Did the consolidation make the memory broader, stronger, newer, or more certain than the evidence supports?

Retrievability

Does the right memory surface under the kind of query the future agent is likely to ask, not only under the exact test wording?

Action impact

Does the remembered pattern actually change the agent’s behavior on the next relevant task?

Human auditability

Can a human inspect the source, the synthesized claim, and the test without reverse-engineering the whole memory store?

Human in the loop

The human should audit the tests, not every memory.

This is the part that feels small but probably matters most. A human cannot review every memory consolidation pass. That defeats the point. But a human can occasionally review the tests that protect the memory.

The agent should surface a compact sample: high-priority tests, newly generated tests, tests whose expected answer changed, tests that failed then passed after repair, and tests whose source evidence is thin. That is where human judgment has leverage.

In practice, this gives you a governance layer. The agent owns the routine memory hygiene. The human owns the semantics of what “not forgotten” actually means.

Design pattern

What I would build first

Write path

Extract candidate memories with source links, priority, expiry hints, and memory type.

Test generation

Generate one to three probes per durable memory, including exact-answer and behavioral probes.

Consolidation

Cluster related memories, rewrite them into a smaller claim, preserve exceptions and provenance.

Regression run

Ask the post-consolidation agent to answer the tests using only the new memory surface.

Repair loop

If tests fail, restore source detail, split the claim, weaken the generalization, or mark for human review.

The slogan is not “store less.” The slogan is “store less only after proving the future agent can still behave as if it learned the right thing.”

Concrete implementation

If I were implementing this tomorrow, I’d build a memory garbage collector with tests.

Restrict writes: agents can write events and propose durable memories, but cannot freely create unbounded wiki files. Durable writes go through a schema: claim, type, scope, source pointers, confidence, priority, expiry hint, and generated probes.

Run sleep-phase consolidation: every N hours or after N events, cluster by entity/topic/task, remove duplicates, identify contradictions, and produce a smaller semantic memory. Keep raw sources for a grace period.

Score decay separately from deletion: every memory has retrieval strength and retention priority. Retrieval strength can decay quickly; deletion should require a policy decision. This prevents stale facts from polluting context without instantly losing auditability.

Protect classes of memory: identity, safety, durable user preferences, active projects, account boundaries, and explicit “remember this” facts should decay slowly or require human approval to delete. Chit-chat, transient plans, duplicate observations, and failed intermediate attempts can decay fast.

Gate compaction on tests: before replacing 200 files with 12 semantic notes, run local memory probes. If fact, exception, or behavior tests fail, split the note or preserve source detail. If the test itself seems bad, ask for human review.

There is a tempting but wrong way to think about this, which is that memory tests are just another benchmark. They are not. A benchmark tells you whether your memory architecture is good in general. A memory test tells you whether this agent, with this history, still knows the thing it promised to know.

That distinction matters because personal and organizational agents accumulate weird local knowledge. They know that one repo has a misleading README, that one teammate uses “soon” to mean “this quarter,” that one customer’s legal name differs from the name in Slack, that one user wants sharp source-labeled answers but not a wall of citations. These facts are too local for public benchmarks, but too important to throw into a lossy summary.

The best version of the idea also creates pressure in the right direction. If an agent has to write a test for a memory, it has to understand why the memory matters. If it cannot write a meaningful probe, maybe the memory was not important. If every consolidation fails the same class of probe, maybe the system is preserving facts but losing causality. If the human keeps editing the expected answers, maybe the agent is learning the wrong abstraction.

This is why I think the analogy to unit tests is more than cute. Unit tests changed refactoring because they gave programmers permission to simplify code while protecting behavior. Memory unit tests could do the same thing for agents: give them permission to compress, generalize, and reorganize memory while protecting the user’s actual continuity.

The open research question is whether agents can generate good enough tests for their own memories without poisoning the process. I think the answer is probably “yes, but only with provenance, sampling, and human review.” Reflexion already shows that agents can use self-written tests and feedback to improve code tasks. ReasoningBank shows that agents can distill reusable lessons from trajectories. Evo-Memory shows that the benchmark frontier is moving toward streaming experience reuse. The missing production pattern is to bind each durable lesson to a durable check.

That is the important part. A memory system that cannot tell whether it forgot something is not a memory system. It is a diary with a delete button.

Sources and further reading

The research trail

These are the pieces I would put on the reading list before turning the idea into a production spec.

ReasoningBank — Google Research ReasoningBank paper Evo-Memory / ReMem Reflexion Generative Agents MemGPT Mem0 memory benchmarks Memory for Autonomous LLM Agents survey Survey on LLM-agent memory mechanisms LangChain — Memory for agents Mem0 — token-efficient memory algorithm FSFM — selective forgetting Human-inspired memory architecture Human-like recall and consolidation

Working claim: the closest existing work proves the ingredients. I have not found a clean production pattern that treats an agent’s self-generated memory probes as a first-class regression suite for consolidation. That is the novelty worth pushing.