Paper 4 · ArgosBrain field report · 2026-04-25

ArgosBrain on Kubernetes 1.32.0: Two Live Runs of an MCP-Served Code-Memory Engine

We report two end-to-end runs of an MCP-served code-memory engine on Kubernetes v1.32.0 (17,171 source files, 303,722 symbol nodes, 2,245,124 call-graph edges). Run A applies a 22-category security sink library and scores exploitability per hit. Run B asks the engine for an architectural code tour. In total the experiments consumed 29,500 tokens and $0.44 in API spend across both runs; wall-clock was 70 seconds for the security pass and 6 seconds for the code tour. The agent's own analysis surfaced four library limitations of our default pattern set, each of which is fixed in the v0.8.9 release accompanying this paper.

Author
Aurelian Jibleanu
Neurogenesis · April 2026
Reproduces
v0.8.9 release on macOS / Linux
Anthropic Claude Opus 4.7 (driver) · Kubernetes v1.32.0 (subject) · Wall-clock measured on Apple Silicon (M-series)
License
CC BY 4.0 (paper) · BUSL-1.1 (engine)
Adapter stubs and methodology open. Engine commercial.
01Motivation

Why test against Kubernetes specifically.

An MCP-served code-memory engine that helps an AI agent on a 200-file SaaS prototype proves nothing about its enterprise-readiness. Kubernetes v1.32.0 is the opposite extreme: a Go-dominant monorepo of 17,171 source files, audited continuously by Google, Red Hat, the CNCF Security Technical Advisory Group, plus Snyk and Semgrep upstream as commercial scanners. Any code-memory engine that is going to host the structural retrieval layer of an enterprise AI workflow has to scale to and produce sensible output on this corpus. Anything less is marketing.

This paper records, in detail, what happened when we ran two distinct workflows over the same fresh ingest of v1.32.0: a 22-category security audit (Run A), and a from-scratch architectural code tour (Run B). The runs were driven by Anthropic's Claude Opus 4.7 over the standard Model Context Protocol stdio transport; ArgosBrain ran in-process as a Rust binary. We report headline economics, structural findings, false-positive analysis, and the library limitations the experiment exposed — every one of which is now fixed in the v0.8.9 release that accompanies this paper.

02Setup

Subject and engine.

Subject corpus. Kubernetes v1.32.0 release source archive. Counts after ingest: 17,171 files, 303,722 symbol nodes (functions, methods, types, constants), 2,245,124 inbound-call edges in the call-graph layer, 30,599 distinct function definitions. Tree-sitter handled the parse; Go is the dominant language, with peripheral Bash, YAML, Dockerfile, and protobuf chunks left untouched by structural queries.

Engine. ArgosBrain v0.8.9 (the release that accompanies this paper). The engine is in-process Rust (petgraph + custom indexing layers) reading from a local brain.bin snapshot. Persistence on the read path is byte-identical between runs because the graph state is locked behind the file-lock guarantee added in v0.8.9 (Paper 3 describes the retrieval architecture; the lock is specific to v0.8.9 and prevents two MCP processes from concurrently mutating the brain).

Driver. Anthropic Claude Opus 4.7 issuing tool calls over MCP stdio. Token usage was sampled from Anthropic's billing telemetry; pricing applied is Opus's published $15 per million input tokens.

Methodology. Two independent runs on a fresh brain (cold cache for Run A; the warm brain produced by Run A was used as the start state for Run B). Each run was a single agent session driven from a top-level user prompt; we did not steer the agent mid-run.

03Run A · Security audit

22 sink categories. 33 MCP tool calls. Zero reachable critical findings.

The agent applied the full default sink library (Cross-site scripting, SQL injection, SSRF, RCE, command injection, path traversal, deserialisation, hardcoded secrets, weak crypto, insecure random, XXE, LDAP injection, prototype pollution, open redirect, buffer overflow, unsafe Rust, regex DoS, timing attack, crypto IV reuse, CORS wildcard, cookie insecure flags, JWT alg=none, TLS verification disabled, cloud API key, PEM private-key block — 22 categories after merging the cmd_injection split). For each category the agent ran find_sinks(kind=...), then triage_sinks(...) for exploitability scoring, then check_reachability(...) to determine whether any source-to-sink path of depth ≤ 8 existed.

Results

  • 🔴 Critical — 0. Triage scored every hit at exploitability ≤ 0.55. Reachability scan from any tracked source (HTTP entry, CLI handler, env-var read, file-read syscall) found 0 paths to any rce or cmd_injection sink within depth 8.
  • 🟠 High — 0 reachable. The pattern library produced 18 high-score raw matches: 10 eval( RCE candidates and 8 sh -c command-injection candidates. All 18 were structurally unreachable. The 10 eval( hits were Go method names — EvalUser, EvalClaimMapping, a sync-track get lazy-loader — i.e. CEL evaluation methods, not Python eval. The 8 sh -c hits were inside test/e2e/ orchestrating subprocess pods, not server-side request paths.
  • 🟡 Medium — 0 reachable production findings. Ten InsecureSkipVerify sites surfaced, every one documented inline by upstream Kubernetes engineers: liveness probes with "Prober that will skip TLS verification while probing" kubelet contexts; container-lifecycle clients with the explicit comment "must not be modified to include credentials"; loopback-only componentstatus checks against 127.0.0.1; peer proxy with separate-layer SA-token auth; kubeadm bootstrap before the TLS trust chain has been established. Each is architecturally constrained to a bootstrap or loopback context. No reachable attack path.
  • 🪦 Dead-code candidates — 0. Every symbol in the call-graph has at least one inbound caller — expected for a mature monorepo with end-to-end test coverage.
  • ✅ Verified clean. 0 SSRF candidates after substring + regex passes. 0 SQLi (etcd-backed; no SQL drivers imported). 0 XSS (no HTML rendering in server code). 0 path traversal patterns. 0 deserialisation gadgets (no pickle / Java ObjectInputStream equivalents). 0 hardcoded secrets in production paths (all matches were under test/). 0 timing-attack comparisons. 0 cookie insecure flags. 0 JWT alg=none patterns. 0 unsafe Rust (Go codebase). 0 ReDoS catastrophic regex.

Token cost & wall-clock

Run A consumed approximately 22,000 input tokens across 33 MCP tool calls plus 8 regex-fallback patterns. Wall-clock was approximately 70 seconds: 60 seconds dominated by initial ingest (tree-sitter parse, symbol-node construction, inbound-edge tabulation, brain.bin snapshot), and 10 seconds for the agent's actual scan. Subsequent runs against the warm brain skip the 60-second ingest entirely; we measured 9.4–11.2 seconds for repeat scans on the same brain.

At Anthropic Opus 4.7 input pricing of $15 per million tokens, total API cost for Run A was approximately $0.33.

04Run B · Architectural code tour

11 MCP queries. 6 seconds. The engineering culture, deduced from edges.

For Run B we asked the agent to produce a from-scratch architectural overview of Kubernetes v1.32.0 — the kind of document a senior engineer writes for a new hire after spending a month with the codebase. The agent issued 11 queries against the warm brain and produced its tour in approximately 6 seconds wall-clock for ~7,500 tokens, total cost $0.11.

What structural memory revealed

The architectural spine. Inbound-edge analysis on the call-graph identified the highest-fan-in subtrees as staging/src/k8s.io/apimachinery and staging/src/k8s.io/client-go. The cmd/* binaries (kube-apiserver, kube-controller-manager, kubelet, kubectl) were correctly identified as leaf nodes — they hold main entry points and link against the libraries, not the other way around. No human supplied this structure; the call-graph proved it.

The reactive heartbeat. Of the 30,599 distinct function definitions, the agent identified NewSharedIndexInformer as the central reactive primitive — the function whose downstream call closure dominates the controller plane. Its summary verdict to the imagined new hire was "Once you grok this, you grok 90% of the controllers." A flat-RAG approach would have to summarise hundreds of controller files to reach the same observation; the call-graph reaches it from a single inbound-degree query.

Naming convention modulo machine-generated noise. The agent ran naming_convention across the corpus and observed that 58% of function names violated typical Go convention (e.g. PascalCase exports for unexported helpers). It then queried the file paths of the offending names and observed they were exclusively under zz_generated_*.go — Kubernetes' machine-emitted deepcopy and conversion code. After excluding the generated subset, the human convention surfaced cleanly: camelCase for private helpers, PascalCase for exported APIs, idiomatic. This is the convention an AI agent should mimic when adding code; mimicking the machine-generated style would produce non-idiomatic patches.

Graceful degradation. When the agent invoked community-detection over the 303,722-node graph, the engine's internal latency monitor threw the call before exceeding its sub-millisecond P99 budget. The fallback path — directory-shape heuristics — returned an answer within budget, with the hit explicitly tagged as a heuristic. This is not transparent fallback (which would silently degrade trust); it is labelled fallback (the consumer knows what algorithm produced the result).

What this means in practice

Run B is a working demonstration of the structural-versus-semantic split argued in Paper 2. The architectural spine and the heartbeat were both identified by graph traversal, not by similarity search. The convention-versus-generated discrimination is similarly structural — it relied on file-path joins and edge-source filtering, not on text embeddings. Where a flat-RAG approach would have ingested megabytes of code into LLM context to reach approximate versions of these conclusions, the engine returned them in a single-digit number of MCP calls.

05Economics

The two-baseline disclosure.

The most credible way to report cost reduction is to publish both the worst-case and the realistic baseline, so that any reader who wants to challenge the claim has the full picture. We do.

Baseline Tokens (Run A) Cost @ Opus 4.7 Reduction
Naive read-every-file
5K tok/file × 17,171 files
~85,800,000 $1,287 99.97%
Realistic agentic baseline
grep + selective reads, ~150 files × 5K tok
~750,000 $11.25 97.1%
ArgosBrain Run A (security) ~22,000 $0.33
ArgosBrain Run B (code tour) ~7,500 $0.11

The 99.97% figure is technically defensible only against the naive baseline. The 97.1% figure survives a sceptical reading because it compares against what an agent that actually uses grep would consume. We claim both. Either places ArgosBrain in a different cost regime entirely — at $0.44 for both runs combined, an enterprise can wire this into per-PR CI without budget approval.

The other axis worth disclosing is the warm-cache cost. The 70-second wall-clock for Run A is dominated by 60 seconds of one-time ingest. Subsequent re-scans on the same brain finished in 9.4–11.2 seconds per run in our measurements. This is what makes per-PR scanning economically viable: you pay the 60-second ingest tax once when a project is first added, and only delta-ingest after that.

06Library gaps observed

What the agent told us about our own engine.

One of the more useful properties of running an agent against your own tool is that the agent is not invested in your tool's reputation. When the patterns are wrong it says so, in writing, in the report. We treat this section as a bug report we couldn't have written ourselves.

Gap 1 · buffer_overflow: substring gets( matched 158 Go methods.

Go's interface convention names getters with a trailing s: PodDisruptionBudgets, GetTargets, *Getter. Our buffer-overflow rule fired on the substring without a leading word-boundary check, producing 158 false positives on the K8s corpus (Go is not a memory-unsafe language, and there is no gets standard library function on it).

v0.8.9 fix. The default sink-pattern table now carries an explicit SinkRefinement table layered above it. Patterns flagged in this table are matched with (a) a leading word boundary and (b) an applicable-language allowlist. gets(, strcpy(, and sprintf( are now restricted to c / cpp chunks. Tests in security_sinks::tests::gets_pattern_does_not_match_go_methods guard the regression.

Gap 2 · rce: substring eval( matched Go CEL evaluators.

Common-expression-language methods EvalUser and EvalClaimMapping triggered the Python-style eval( RCE pattern; a sync-track lazy getter likewise. None of these are Python's eval dynamic-evaluation primitive. The match was a substring artefact, not semantic.

v0.8.9 fix. Word-boundary on the leading edge so EvalUser( no longer matches the substring eval(. Real Python eval(user_input) still matches because the preceding character is whitespace or punctuation. The applicable-language allowlist for eval( remains empty (the pattern can fire on any language); only the boundary changed.

Gap 3 · cloud_api_key: substring key- matched 128 Go field names.

The Mailgun API-key prefix is genuinely key-<32 hex chars>. We ship both a substring pattern key- and a regex form key-[a-f0-9]{32}; the regex was correct, but the substring trip-wire fired on every key-id, signing-key-version, --key-file CLI flag in the K8s codebase. The regex pass alone is sufficient to catch real Mailgun keys.

v0.8.9 fix. Word-boundary on the substring keeps it from firing on hyphenated field-name compounds. A future v0.9.x extension will add Shannon-entropy filtering on the value half, lifting precision further.

Gap 4 · weak_crypto: engine returned 0 hits, regex fallback found 7.

Seven production imports of crypto/md5 and crypto/sha1 existed in the corpus. After read-confirmation the agent classified all seven as appropriate — they were used for cache keying / endpoint hashing, not for security primitives. But silent zero is worse than flagged-and-reviewed zero: the engine should have surfaced them so a human reviewer could decide.

v0.8.9 fix. Imports of weakly-collision-resistant hash functions are now surfaced as findings tagged with the new SinkConfidence::PatternOnly bucket. Agents that want to fast-path skip these in triage can do so by confidence label; humans reviewing the report still see them.

v0.8.9 also ships

Beyond the four pattern-library fixes, v0.8.9 introduces:

  • OS-level file lock on brain.bin. A previously-unsurfaced race window allowed two argosbrain-mcp processes (e.g. the previous and the freshly-spawned MCP child during an editor restart) to concurrently mutate the same brain. The new lock detects the conflict at startup and refuses to load the second copy with an actionable error message identifying the holder PID and start time. Cross-platform via fs2 (advisory flock on Unix, LockFileEx on Windows). Stale lock files (from kill -9) auto-clear because the OS releases the advisory lock when the holder process dies.
  • SinkConfidence enum. A four-way categorical attached to every SinkMatch: HighConfidence (regex precision + reachable callers), StructurallyReachable (substring precision + reachable callers), PatternOnly (no inbound callers / dead code), LibraryNoise (known-noisy pattern on this language). Lets agents skip the LibraryNoise bucket without spending tokens on it.
07Limitations

What this method does not claim.

  • Control flow only. The engine answers reachability over the call-graph: is there a structural path from source to sink within depth N? It does not perform field-level data-flow taint analysis. For that, pair the engine with Semgrep Pro or CodeQL — Kubernetes already runs both upstream, and we recommend the same pairing for any production deployment that needs taint-flow guarantees.
  • Static analysis. Reflection (reflect.Call), runtime configuration injection, KEP-driven feature flags, dynamic plugin loading, and importlib / getattr dynamic-dispatch patterns are invisible to AST-walkers. The skill flags these explicitly via the analysis_blind_spots field on ControlFlowPathReport when it sees them.
  • Findings are candidates, not confirmed CVEs. Disclosure of real vulnerabilities goes through the Kubernetes Security Response Committee. This paper reports structural evidence about the codebase, scoped to the call-graph our engine constructed at the v1.32.0 source ingest.
  • One-off measurement. The numbers reported are for a specific commit, a specific driver model (Opus 4.7), specific local hardware (Apple Silicon M-series), and a specific setting of the agent prompt. Reproductions on other configurations will produce slightly different token counts and wall-clock times. The methodology in §02 is the contract; the numbers in §03–05 are exemplars.
  • Compute-bound on first ingest. The 60-second initial ingest is dominated by tree-sitter parse + petgraph construction, both of which are CPU-bound rather than I/O-bound. On older hardware first-ingest will be longer; we measured ~110 seconds on a 2018-vintage Intel Mac. Subsequent runs on the warm brain are unaffected.
  • Library gaps will recur. The four gaps fixed in v0.8.9 are gaps we now know about. Running the same audit against a non-Go monorepo (e.g. a Rust kernel or a Java microservice mesh) will surface a different set; we anticipate publishing a follow-up paper after applying the engine to a Rust corpus.
08Conclusion

What we believe these runs demonstrate.

Two propositions emerge from the data above. First, an in-process Rust code-memory engine, exposed via MCP and driven by a frontier LLM, can produce reachability evidence over a 17,000-file corpus in under 90 seconds end-to-end and at total API cost below $0.50. The wall-clock and cost regime is qualitatively different from grep-then-LLM-then-summarise pipelines, even after accounting for the inflated naive baseline; the realistic-baseline reduction is still nearly two orders of magnitude.

Second, the same engine, on the same brain, can drive an architectural code tour that surfaces the structural spine, the central reactive primitive, and the human-versus-machine naming convention in 11 queries and 6 seconds. The conclusions an agent reaches with structural memory are not approximations of conclusions a flat-RAG agent reaches with more tokens — they are conclusions a flat-RAG agent cannot reach at all without simulating a graph traversal in the LLM context window.

Limitations are real. Field-level taint analysis is out of scope. Reflection-heavy and config-driven dispatch is invisible. Library patterns drift on new languages and require ongoing curation, as the four gaps fixed in v0.8.9 demonstrate. We publish those gaps because if a CISO is going to evaluate this engine, they should see what we already know is wrong before their team finds it on a Friday afternoon. Trust is built by disclosing, not by hiding.

The v0.8.9 release accompanying this paper is open for direct reproduction. The engine, the binary, the install path, the security-review skill that drove Run A, and the code-tour skill that drove Run B are all on GitHub. Anyone with a Mac or Linux box, a Claude Code or Cursor install, and 90 seconds can ingest Kubernetes v1.32.0 and run the same workflow we did. We invite the comparison.

Cite as
Jibleanu, A. (2026). ArgosBrain on Kubernetes 1.32.0:
Two Live Runs of an MCP-Served Code-Memory Engine.
Neurogenesis Technical Report.
https://argosbrain.com/papers/argos-vs-kubernetes
Reproduce
curl -fsSL https://argosbrain.com/install | sh
git clone https://github.com/kubernetes/kubernetes
cd kubernetes && git checkout v1.32.0
argosbrain ingest .
# In Claude Code: /security-review
# Or:           /argos-code-tour