Stress-Testing Code Memory at Kubernetes Scale

Why we built this

LongMemCode v0.1 measured one-shot retrieval — "given a query, did the memory system return the right symbol?" — across 16 corpora and 35 scenario sub-types. Useful, but we kept seeing a gap. When we sampled 933 user prompts from three production developer sessions on three different stacks (Remix, Next.js, Rust), about half of all prompts were not fresh investigation tasks at all. They were follow-ups, approvals, status checks, redirects, repeated lookups across a session.

A memory system can ace ApiDiscovery and still bleed token budget if every "ok" or "and how is it cancelled?" reloads the world. v0.3 is the first benchmark we know of that scores those workflow shapes alongside the structural ones, deterministically, without an LLM judge.

What's new in v0.3

v0.2 added four scenario categories that capture sustained-session behaviour:

ConversationalContinuation — follow-up queries on prior context (pronoun, approval, redirect patterns).
RereadCascade — same lookup repeated 3-5× per session.
BashFeedbackLoop — compiler/test errors mapped to source symbols, grounded in real bug-fix commits.
SubAgentContextHandoff — main agent → sub-agent process handoff.

v0.3 adds oracle pinning. Three of those four categories shipped in v0.2 with a deliberately weak oracle — "any non-empty response counts as a pass." That caught crashes but did not distinguish "returned the right symbols" from "returned three random unrelated ones." v0.3 runs the reference adapter offline, captures the exact top-3 returned stable identifiers, and pins them as the canonical expected set. After pinning, the suite is adversarially scoreable: an adapter that returns different non-empty results fails rather than passes.

Of 361 weak-oracle scenarios in v0.2, 272 received pinned oracles in v0.3. The remaining 89 were skipped — the reference adapter itself returned empty for those queries (typically test-private symbols not surfaced by the canonical scip output) — and are flagged via expected.pin_skipped_reason in the JSON so reviewers know not to weight their score signal heavily. Full rationale, including why pinning against a single reference is not as circular as it sounds, is in the v0.3 methodology document.

The corpus

Kubernetes v1.32.0, the largest corpus we ship. Numbers, all reproducible by running ./corpora/kubernetes.sh:

333 MB Go source (after git clone --depth=1)
128 MB SCIP index produced by scip-go
16 MB ArgosBundle, containing 38 771 symbols and 232 756 call-graph edges
Bundle id (deterministic from corpus content): a38adae682e89f9ec86b59373c2b57cbd85d3aadc29bcedc9a5fe9beea506008

The 953 scenarios

Distribution across categories, all stratified deterministically:

Category	n	passed	avg score
ConversationalContinuation	261	261	100.00%
Completion	161	157	97.52%
RereadCascade	109	105	96.33%
BashFeedbackLoop	100	100	100.00%
BugFix	100	100	100.00%
FeatureAdd	52	52	100.00%
ApiDiscovery	50	50	100.00%
SubAgentContextHandoff	50	50	100.00%
Refactor	42	42	100.00%
TestGen	28	28	100.00%
TOTAL	953	945	99.16%

8 of 10 categories at 100%. The two with misses (Completion, RereadCascade) inherit query shapes from the v0.1 suite, where ArgosBrain previously scored 99.25% — fully consistent with that baseline.

The 100 real bug-fix commits

The most rigorous category in the suite is BashFeedbackLoop, where every single scenario is grounded in a real Kubernetes commit. The mining process:

Fetch the full Kubernetes git history (git fetch --unshallow) — 127 126 commits at v1.32.0.
Filter for fix commits via git log --grep "^fix:" --grep "^bug:" --grep "^Fix bug" -i.
For each commit, take the first .go file changed.
Extract a Go function or type identifier from the current version of that file.
Synthesise a Go-style compiler error pointing at that symbol (undefined: X, X redeclared in this block, etc).
Ground truth: the symbol is the exact one the fix commit touched. The commit SHA is recorded in each scenario's context.rationale_commit field, so any reviewer can git show <sha> to verify.

Result: ArgosBrain mapped 100 of 100 compiler errors to the exact symbol the Google engineer touched in the real fix. Every commit verifiable on GitHub. Generator code at tools/generate_v2_kubernetes_scenarios.py.

Latency and cost

p50 latency	0.008 ms	8 microseconds median
p95 latency	0.331 ms	95th percentile under a third of a millisecond
p99 latency	0.404 ms	worst case under half a millisecond
Total wall-clock	87 ms	all 953 scenarios end-to-end
Cost / 1k queries	$0.0000	no LLM call on the read path, ever

The eight misses (we publish them)

Out of 953 scenarios, ArgosBrain missed 8. All eight reduce to the same four root causes — common bare names where the v0.1 oracle picked an obscure variant and ArgosBrain's structural ranker preferred the production-code variant.

Bare name	What ArgosBrain returned	What v0.1 expected
`Manager`	`kubelet/token/Manager`, `kubelet/runtimeclass/Manager`, …	`kubelet/secret/Manager`
`Service`	`pkg/apis/core/Service`, `k8s.io/api/core/v1/Service`	`test/e2e/storage/drivers/csi-test/mock/service/Service`
`Graph`	`plugin/pkg/auth/authorizer/node/Graph`	`third_party/forked/gonum/graph/Graph`
`testAction`	(no result)	`test/integration/apiserver/discovery/testAction`

ArgosBrain's ranker prefers production code over vendored / mock / private-test variants. For a developer asking "find Manager", returning kubelet/runtimeclass/Manager is more useful than csi-test/mock/Service. The benchmark scenarios pick one specific variant as ground truth, so we lose those scenarios. We publish the failures because they illustrate the trade-off, not because they hide a bug.

Reproduce it yourself

# 1. Clone the benchmark
git clone https://github.com/CataDef/LongMemCode.git
cd LongMemCode

# 2. Build the Kubernetes corpus (clones K8s v1.32.0, runs scip-go, builds bundle)
./corpora/kubernetes.sh

# 3. Build the runner + adapter
cargo build --release --bin lmc-runner --bin lmc-adapter-argosbrain

# 4. Run on v0.3 — should match the published 99.16%
./target/release/lmc-runner \
    --adapter ./target/release/lmc-adapter-argosbrain \
    --adapter-args "--corpus corpora/_work/kubernetes/kubernetes.argosbundle" \
    --scenarios scenarios/kubernetes-v3.json \
    --out my-result.json

# 5. Diff against published
diff <(jq .summary my-result.json) \
     <(jq .summary results/argosbrain-kubernetes-v3-2026-04-24.json)

What this does and does not measure

v0.3 measures sustained-session retrieval quality at scale on a real production corpus. It does not measure end-to-end agent task success (that is SWE-bench's job), and it does not measure code-generation quality (that is HumanEval / MBPP territory). It measures the memory layer in isolation, which is what an MCP-style memory server is responsible for.

It also does not yet measure:

Edit-cascade invalidation (RereadCascade is repeated reads, not reads with intermediate edits — runner support for an op: "edit" between queries is v0.4 work).
Multi-process consistency under concurrent load (Brain Manager IPC).
Multimodal context (images, PDFs, audio) — out of scope for v0.3.

The honest framing on Mem0, Zep, Letta

We do not benchmark Mem0, Zep, or Letta as competitors here, because they are not competitors. They solve a different problem (conversational memory between user and agent) than ArgosBrain solves (structural memory of source code). A production coding agent benefits from running both classes of memory — they are complementary, not substitutable. Earlier versions of LongMemCode did benchmark Mem0 on the Kubernetes corpus (4.93% accuracy at 1 677 ms p99 with ~$0.20 per 1 000 queries) for honest framing; those numbers reflect scope mismatch, not Mem0 quality on its core workload.

Try it

One command, sixty seconds:

curl -fsSL https://argosbrain.com/install | sh

Free tier is genuinely free, forever — one project, all retrieval features, no credit card. Pro and Team unlock unlimited projects + the full sink scanning, reachability, and architectural-drift toolchain. Pricing on the homepage.

Authors: ArgosBrain Team · Date: 2026-04-24 · License: CC BY 4.0 · Benchmark: github.com/CataDef/LongMemCode