Why we built this
LongMemCode v0.1 measured one-shot retrieval — "given a query, did the memory system return the right symbol?" — across 16 corpora and 35 scenario sub-types. Useful, but we kept seeing a gap. When we sampled 933 user prompts from three production developer sessions on three different stacks (Remix, Next.js, Rust), about half of all prompts were not fresh investigation tasks at all. They were follow-ups, approvals, status checks, redirects, repeated lookups across a session.
A memory system can ace ApiDiscovery and still bleed token budget if every "ok" or "and how is it cancelled?" reloads the world. v0.3 is the first benchmark we know of that scores those workflow shapes alongside the structural ones, deterministically, without an LLM judge.
What's new in v0.3
v0.2 added four scenario categories that capture sustained-session behaviour:
- ConversationalContinuation — follow-up queries on prior context (pronoun, approval, redirect patterns).
- RereadCascade — same lookup repeated 3-5× per session.
- BashFeedbackLoop — compiler/test errors mapped to source symbols, grounded in real bug-fix commits.
- SubAgentContextHandoff — main agent → sub-agent process handoff.
v0.3 adds oracle pinning. Three of those four categories shipped in v0.2 with a deliberately weak oracle — "any non-empty response counts as a pass." That caught crashes but did not distinguish "returned the right symbols" from "returned three random unrelated ones." v0.3 runs the reference adapter offline, captures the exact top-3 returned stable identifiers, and pins them as the canonical expected set. After pinning, the suite is adversarially scoreable: an adapter that returns different non-empty results fails rather than passes.
Of 361 weak-oracle scenarios in v0.2, 272 received pinned oracles in v0.3. The
remaining 89 were skipped — the reference adapter itself returned empty for those
queries (typically test-private symbols not surfaced by the canonical scip output) —
and are flagged via expected.pin_skipped_reason in the JSON so reviewers
know not to weight their score signal heavily. Full rationale, including why pinning
against a single reference is not as circular as it sounds, is in the
v0.3 methodology document.
The corpus
Kubernetes v1.32.0, the largest corpus we ship. Numbers, all reproducible by running
./corpora/kubernetes.sh:
- 333 MB Go source (after
git clone --depth=1) - 128 MB SCIP index produced by
scip-go - 16 MB ArgosBundle, containing 38 771 symbols and 232 756 call-graph edges
- Bundle id (deterministic from corpus content):
a38adae682e89f9ec86b59373c2b57cbd85d3aadc29bcedc9a5fe9beea506008
The 953 scenarios
Distribution across categories, all stratified deterministically:
| Category | n | passed | avg score |
|---|---|---|---|
| ConversationalContinuation | 261 | 261 | 100.00% |
| Completion | 161 | 157 | 97.52% |
| RereadCascade | 109 | 105 | 96.33% |
| BashFeedbackLoop | 100 | 100 | 100.00% |
| BugFix | 100 | 100 | 100.00% |
| FeatureAdd | 52 | 52 | 100.00% |
| ApiDiscovery | 50 | 50 | 100.00% |
| SubAgentContextHandoff | 50 | 50 | 100.00% |
| Refactor | 42 | 42 | 100.00% |
| TestGen | 28 | 28 | 100.00% |
| TOTAL | 953 | 945 | 99.16% |
8 of 10 categories at 100%. The two with misses (Completion, RereadCascade) inherit query shapes from the v0.1 suite, where ArgosBrain previously scored 99.25% — fully consistent with that baseline.
The 100 real bug-fix commits
The most rigorous category in the suite is BashFeedbackLoop, where every single scenario is grounded in a real Kubernetes commit. The mining process:
- Fetch the full Kubernetes git history (
git fetch --unshallow) — 127 126 commits at v1.32.0. - Filter for fix commits via
git log --grep "^fix:" --grep "^bug:" --grep "^Fix bug" -i. - For each commit, take the first
.gofile changed. - Extract a Go function or type identifier from the current version of that file.
- Synthesise a Go-style compiler error pointing at that symbol
(
undefined: X,X redeclared in this block, etc). - Ground truth: the symbol is the exact one the fix commit touched. The commit
SHA is recorded in each scenario's
context.rationale_commitfield, so any reviewer cangit show <sha>to verify.
Result: ArgosBrain mapped 100 of 100 compiler errors to the exact
symbol the Google engineer touched in the real fix. Every commit verifiable on
GitHub. Generator code at
tools/generate_v2_kubernetes_scenarios.py.
Latency and cost
| p50 latency | 0.008 ms | 8 microseconds median |
| p95 latency | 0.331 ms | 95th percentile under a third of a millisecond |
| p99 latency | 0.404 ms | worst case under half a millisecond |
| Total wall-clock | 87 ms | all 953 scenarios end-to-end |
| Cost / 1k queries | $0.0000 | no LLM call on the read path, ever |
The eight misses (we publish them)
Out of 953 scenarios, ArgosBrain missed 8. All eight reduce to the same four root causes — common bare names where the v0.1 oracle picked an obscure variant and ArgosBrain's structural ranker preferred the production-code variant.
| Bare name | What ArgosBrain returned | What v0.1 expected |
|---|---|---|
Manager | kubelet/token/Manager, kubelet/runtimeclass/Manager, … | kubelet/secret/Manager |
Service | pkg/apis/core/Service, k8s.io/api/core/v1/Service | test/e2e/storage/drivers/csi-test/mock/service/Service |
Graph | plugin/pkg/auth/authorizer/node/Graph | third_party/forked/gonum/graph/Graph |
testAction | (no result) | test/integration/apiserver/discovery/testAction |
ArgosBrain's ranker prefers production code over vendored / mock / private-test
variants. For a developer asking "find Manager", returning
kubelet/runtimeclass/Manager is more useful than
csi-test/mock/Service. The benchmark scenarios pick one specific variant
as ground truth, so we lose those scenarios. We publish the failures because they
illustrate the trade-off, not because they hide a bug.
Reproduce it yourself
# 1. Clone the benchmark
git clone https://github.com/CataDef/LongMemCode.git
cd LongMemCode
# 2. Build the Kubernetes corpus (clones K8s v1.32.0, runs scip-go, builds bundle)
./corpora/kubernetes.sh
# 3. Build the runner + adapter
cargo build --release --bin lmc-runner --bin lmc-adapter-argosbrain
# 4. Run on v0.3 — should match the published 99.16%
./target/release/lmc-runner \
--adapter ./target/release/lmc-adapter-argosbrain \
--adapter-args "--corpus corpora/_work/kubernetes/kubernetes.argosbundle" \
--scenarios scenarios/kubernetes-v3.json \
--out my-result.json
# 5. Diff against published
diff <(jq .summary my-result.json) \
<(jq .summary results/argosbrain-kubernetes-v3-2026-04-24.json)
What this does and does not measure
v0.3 measures sustained-session retrieval quality at scale on a real production corpus. It does not measure end-to-end agent task success (that is SWE-bench's job), and it does not measure code-generation quality (that is HumanEval / MBPP territory). It measures the memory layer in isolation, which is what an MCP-style memory server is responsible for.
It also does not yet measure:
- Edit-cascade invalidation (RereadCascade is repeated reads, not reads with
intermediate edits — runner support for an
op: "edit"between queries is v0.4 work). - Multi-process consistency under concurrent load (Brain Manager IPC).
- Multimodal context (images, PDFs, audio) — out of scope for v0.3.
The honest framing on Mem0, Zep, Letta
We do not benchmark Mem0, Zep, or Letta as competitors here, because they are not competitors. They solve a different problem (conversational memory between user and agent) than ArgosBrain solves (structural memory of source code). A production coding agent benefits from running both classes of memory — they are complementary, not substitutable. Earlier versions of LongMemCode did benchmark Mem0 on the Kubernetes corpus (4.93% accuracy at 1 677 ms p99 with ~$0.20 per 1 000 queries) for honest framing; those numbers reflect scope mismatch, not Mem0 quality on its core workload.
Try it
One command, sixty seconds:
curl -fsSL https://argosbrain.com/install | sh
Free tier is genuinely free, forever — one project, all retrieval features, no credit card. Pro and Team unlock unlimited projects + the full sink scanning, reachability, and architectural-drift toolchain. Pricing on the homepage.
Authors: ArgosBrain Team · Date: 2026-04-24 · License: CC BY 4.0 · Benchmark: github.com/CataDef/LongMemCode