Engineering writeup · v0.3 · 2026-04-24

Stress-Testing Code Memory at Kubernetes Scale

We ran the new LongMemCode v0.3 suite — 953 scenarios mined from real Kubernetes v1.32.0 git history and production coding-agent traces — through ArgosBrain. Result: 99.16% accuracy, p99 latency 0.404 ms, $0 per query, 87 ms total wall-clock for the full suite. Eight honest misses, all on the same four ambiguous bare-name lookups. Methodology, every query, and every returned symbol are public on GitHub for byte-for-byte reproduction.

Verifiable · public

Every claim in this writeup links back to the public LongMemCode repo. Open the JSON, run the generator, replay the adapter — the numbers regenerate byte-for-byte from the same Kubernetes commit (v1.32.0) and the same bundle id (a38adae6…506008).

Live scoreboard →  ·  Result JSON →  ·  Per-scenario audit (12 MB) →  ·  All 953 scenarios →

Why we built this

LongMemCode v0.1 measured one-shot retrieval — "given a query, did the memory system return the right symbol?" — across 16 corpora and 35 scenario sub-types. Useful, but we kept seeing a gap. When we sampled 933 user prompts from three production developer sessions on three different stacks (Remix, Next.js, Rust), about half of all prompts were not fresh investigation tasks at all. They were follow-ups, approvals, status checks, redirects, repeated lookups across a session.

A memory system can ace ApiDiscovery and still bleed token budget if every "ok" or "and how is it cancelled?" reloads the world. v0.3 is the first benchmark we know of that scores those workflow shapes alongside the structural ones, deterministically, without an LLM judge.

What's new in v0.3

v0.2 added four scenario categories that capture sustained-session behaviour:

v0.3 adds oracle pinning. Three of those four categories shipped in v0.2 with a deliberately weak oracle — "any non-empty response counts as a pass." That caught crashes but did not distinguish "returned the right symbols" from "returned three random unrelated ones." v0.3 runs the reference adapter offline, captures the exact top-3 returned stable identifiers, and pins them as the canonical expected set. After pinning, the suite is adversarially scoreable: an adapter that returns different non-empty results fails rather than passes.

Of 361 weak-oracle scenarios in v0.2, 272 received pinned oracles in v0.3. The remaining 89 were skipped — the reference adapter itself returned empty for those queries (typically test-private symbols not surfaced by the canonical scip output) — and are flagged via expected.pin_skipped_reason in the JSON so reviewers know not to weight their score signal heavily. Full rationale, including why pinning against a single reference is not as circular as it sounds, is in the v0.3 methodology document.

The corpus

Kubernetes v1.32.0, the largest corpus we ship. Numbers, all reproducible by running ./corpora/kubernetes.sh:

The 953 scenarios

Distribution across categories, all stratified deterministically:

Category n passed avg score
ConversationalContinuation261261100.00%
Completion16115797.52%
RereadCascade10910596.33%
BashFeedbackLoop100100100.00%
BugFix100100100.00%
FeatureAdd5252100.00%
ApiDiscovery5050100.00%
SubAgentContextHandoff5050100.00%
Refactor4242100.00%
TestGen2828100.00%
TOTAL95394599.16%

8 of 10 categories at 100%. The two with misses (Completion, RereadCascade) inherit query shapes from the v0.1 suite, where ArgosBrain previously scored 99.25% — fully consistent with that baseline.

The 100 real bug-fix commits

The most rigorous category in the suite is BashFeedbackLoop, where every single scenario is grounded in a real Kubernetes commit. The mining process:

  1. Fetch the full Kubernetes git history (git fetch --unshallow) — 127 126 commits at v1.32.0.
  2. Filter for fix commits via git log --grep "^fix:" --grep "^bug:" --grep "^Fix bug" -i.
  3. For each commit, take the first .go file changed.
  4. Extract a Go function or type identifier from the current version of that file.
  5. Synthesise a Go-style compiler error pointing at that symbol (undefined: X, X redeclared in this block, etc).
  6. Ground truth: the symbol is the exact one the fix commit touched. The commit SHA is recorded in each scenario's context.rationale_commit field, so any reviewer can git show <sha> to verify.

Result: ArgosBrain mapped 100 of 100 compiler errors to the exact symbol the Google engineer touched in the real fix. Every commit verifiable on GitHub. Generator code at tools/generate_v2_kubernetes_scenarios.py.

Latency and cost

p50 latency0.008 ms8 microseconds median
p95 latency0.331 ms95th percentile under a third of a millisecond
p99 latency0.404 msworst case under half a millisecond
Total wall-clock87 msall 953 scenarios end-to-end
Cost / 1k queries$0.0000no LLM call on the read path, ever

The eight misses (we publish them)

Out of 953 scenarios, ArgosBrain missed 8. All eight reduce to the same four root causes — common bare names where the v0.1 oracle picked an obscure variant and ArgosBrain's structural ranker preferred the production-code variant.

Bare name What ArgosBrain returned What v0.1 expected
Managerkubelet/token/Manager, kubelet/runtimeclass/Manager, …kubelet/secret/Manager
Servicepkg/apis/core/Service, k8s.io/api/core/v1/Servicetest/e2e/storage/drivers/csi-test/mock/service/Service
Graphplugin/pkg/auth/authorizer/node/Graphthird_party/forked/gonum/graph/Graph
testAction(no result)test/integration/apiserver/discovery/testAction

ArgosBrain's ranker prefers production code over vendored / mock / private-test variants. For a developer asking "find Manager", returning kubelet/runtimeclass/Manager is more useful than csi-test/mock/Service. The benchmark scenarios pick one specific variant as ground truth, so we lose those scenarios. We publish the failures because they illustrate the trade-off, not because they hide a bug.

Reproduce it yourself

# 1. Clone the benchmark
git clone https://github.com/CataDef/LongMemCode.git
cd LongMemCode

# 2. Build the Kubernetes corpus (clones K8s v1.32.0, runs scip-go, builds bundle)
./corpora/kubernetes.sh

# 3. Build the runner + adapter
cargo build --release --bin lmc-runner --bin lmc-adapter-argosbrain

# 4. Run on v0.3 — should match the published 99.16%
./target/release/lmc-runner \
    --adapter ./target/release/lmc-adapter-argosbrain \
    --adapter-args "--corpus corpora/_work/kubernetes/kubernetes.argosbundle" \
    --scenarios scenarios/kubernetes-v3.json \
    --out my-result.json

# 5. Diff against published
diff <(jq .summary my-result.json) \
     <(jq .summary results/argosbrain-kubernetes-v3-2026-04-24.json)

What this does and does not measure

v0.3 measures sustained-session retrieval quality at scale on a real production corpus. It does not measure end-to-end agent task success (that is SWE-bench's job), and it does not measure code-generation quality (that is HumanEval / MBPP territory). It measures the memory layer in isolation, which is what an MCP-style memory server is responsible for.

It also does not yet measure:

The honest framing on Mem0, Zep, Letta

We do not benchmark Mem0, Zep, or Letta as competitors here, because they are not competitors. They solve a different problem (conversational memory between user and agent) than ArgosBrain solves (structural memory of source code). A production coding agent benefits from running both classes of memory — they are complementary, not substitutable. Earlier versions of LongMemCode did benchmark Mem0 on the Kubernetes corpus (4.93% accuracy at 1 677 ms p99 with ~$0.20 per 1 000 queries) for honest framing; those numbers reflect scope mismatch, not Mem0 quality on its core workload.

Try it

One command, sixty seconds:

curl -fsSL https://argosbrain.com/install | sh

Free tier is genuinely free, forever — one project, all retrieval features, no credit card. Pro and Team unlock unlimited projects + the full sink scanning, reachability, and architectural-drift toolchain. Pricing on the homepage.


Authors: ArgosBrain Team · Date: 2026-04-24 · License: CC BY 4.0 · Benchmark: github.com/CataDef/LongMemCode