Two case studies · 2026-04-25

We pointed an AI at Kubernetes 1.32.0
twice. Total cost: 44¢.

Same codebase — 17,171 files, 303,722 symbols, 2,245,124 call-graph edges. Two different jobs. Security audit: 22 sink categories triaged, zero reachable critical findings, library gaps publicly disclosed. 70 seconds, $0.33. Architectural onboarding: the AI deduced the engineering culture of a 6-month-onboarding monolith. 6 seconds, $0.11. Both runs reproduce on your laptop.

17,171
source files in scope
303,722
graph nodes (symbols)
2,245,124
call-graph edges
30,599
function definitions
22
sink categories triaged
0
reachable critical findings
$0.44
total cost (both runs)
~10 s
re-scan on warm cache
01Why we ran this

To stress-test ArgosBrain on the largest open-source codebase that matters.

Most security tools demo on a 200-file Express app and call it done. Kubernetes is the opposite end of the spectrum — 17 171 files, 303 722 symbols, audited continuously by Google, Red Hat, the CNCF Security TAG, plus Snyk and Semgrep upstream. If our engine produces noise here, no enterprise will trust it. If it produces clean reachability evidence, we have something real.

This page reports what happened, including the patterns where our default sink library was too coarse (and how v0.8.9 ships a fix). We publish the gaps because if you're a CISO evaluating this, you should see what we already know is wrong before your team finds it on a Friday afternoon.

02Case study 01 · Security audit

Zero critical findings reachable from any source.

The triage layer scored every potential sink against four signals — pattern precision, inbound caller count, source-to-sink path within depth 8, and library-noise heuristics. Of the 18 high-score hits found, every one was structurally unreachable from any tracked HTTP entry, CLI handler, env-var reader, or file-read syscall.

The full bucket breakdown:

  • 🔴 CRITICAL — 0. Triage scored every hit ≤ 0.55. Reachability scan found 0 paths from any tracked source to any rce or cmd_injection sink within depth 8.
  • 🟠 HIGH — 0 reachable. 10 eval( RCE hits + 8 sh -c command-injection hits triaged at score 0.55. All 18 structurally unreachable. Every eval( match was a Go method name (EvalUser, EvalClaimMapping) — CEL evaluation, not Python eval. Every sh -c hit was inside test/e2e/.
  • 🟡 MEDIUM — 0 reachable production findings. 10 InsecureSkipVerify sites — all liveness/readiness probes against pod-local self-signed certs, kubelet container lifecycle clients with explicit "must not include credentials" comments, kubeadm bootstrap before TLS trust is established. Each documented inline by upstream.
  • 🪦 DEAD CODE — 0 candidates. Every symbol has at least one caller in the ingested graph. Expected for a mature monorepo with full e2e coverage.
  • ✅ VERIFIED CLEAN. 0 SSRF candidates. 0 SQLi (etcd-backed, no SQL). 0 XSS. 0 path traversal. 0 deserialisation gadgets. 0 hardcoded secrets in production paths. 0 timing-attack comparisons. 0 JWT alg=none. 0 cookie insecure flags. 0 unsafe Rust (Go codebase). 0 ReDoS catastrophic regex.

This is not a clean bill of health for Kubernetes — the Kubernetes Security Response Committee handles real disclosures. It is structural evidence that within the call-graph our engine built, no attacker-controlled input reaches a dangerous sink in eight or fewer hops. That's the claim, that's the scope.

03Case study 02 · Architectural onboarding

Onboard an AI to a 2.2-million-edge monolith in 6 seconds.

Understanding the Kubernetes monolith takes a human engineer six months. Forcing a standard AI agent to read it takes 85 million tokens, endless grep timeouts, and guarantees context rot.

We connected Claude Code to ArgosBrain and asked for a complete architectural tour of Kubernetes v1.32.0. Because the agent could query the deterministic graph instead of reading text, it didn't just list files — it deduced the system's entire engineering culture.

┌── KUBERNETES 1.32.0 GRAPH ─────────────────────┐
│ Nodes (Symbols):     303,722                   │
│ Edges (Call-graph):  2,245,124                 │
│ Source Files:         17,171                   │
│ Function defs:        30,599                   │
│                                                │
│ Argos queries:        11                       │
│ Wall-clock time:      ~6 seconds               │
└────────────────────────────────────────────────┘

What structural memory actually reveals

  • The architectural spine. By tracing inbound and outbound edges, the AI mathematically proved that staging/.../apimachinery and client-go are the core libraries with the highest fan-in, while the cmd/* binaries are strictly leaf nodes. No human told it; the call-graph did.
  • The heartbeat. Out of 30,599 functions, the agent pinpointed NewSharedIndexInformer as the central reactive primitive. Its verdict to a future developer: "Once you grok this, you grok 90% of the controllers." A typical RAG would have needed to summarise 200 controller files to reach the same conclusion. The graph reaches it in one query.
  • The human convention behind the noise. Naming-convention analysis surfaced 58% non-conforming function names. ArgosBrain didn't blindly report drift — it observed that the offending names were exclusively in zz_generated_*.go files (kube's machine-generated deepcopy / conversion code). It correctly excluded those and surfaced the actual human convention: camelCase for private helpers. So the AI writes code that matches what humans wrote, not what code-generators emit.
  • Graceful degradation under latency pressure. When the community-detection algorithm threatened the sub-millisecond P99 budget on a 300K-node graph, the engine safely fell back to directory-shape heuristics — keeping the agent unblocked instead of timing out. Determinism is preserved; the result is labelled with the heuristic-source so an auditor can tell.
┌── THE TOKEN LEDGER (Code Tour) ────────────────┐
│ Naive RAG / glob baseline:  ~85,800,000 tok    │
│ ArgosBrain graph traversal:      ~7,500 tok    │
│                                                │
│ Token reduction:                  99.99 %      │
│ Total API cost:                    $0.11       │
└────────────────────────────────────────────────┘

The gap between "list every file" and "deduce the engineering culture" is the gap between a search engine and a reasoning substrate. ArgosBrain is the substrate; the LLM is the reasoner; together they cover ground neither can cover alone.

04The economics

$0.44 for both runs. Every PR, not every quarter.

The agent driving the security audit (Claude Opus 4.7) consumed approximately 22,000 tokens across 33 MCP tool calls and 8 regex fallback patterns — $0.33. The Code Tour run consumed approximately 7,500 tokens across 11 MCP queries — $0.11. Both at Opus input pricing of $15 per million tokens.

Two baselines for comparison:

Approach Tokens Cost (Opus 4.7) Reduction
Naive: read every file
5K tok/file × 17,171 files
~85,800,000 $1,287 99.97%
Realistic agent: grep + selective reads
~150 files × 5K tok/file
~750,000 $11.25 97.1%
ArgosBrain (this run)
graph + reachability + selective reads
22,000 $0.33

We publish both baselines on purpose. The 99.97% headline number compares against the worst-case "read every file" approach — useful as a ceiling, but no real agent does that. The 97.1% comparison is against a competent agent using grep first; that's the number that survives technical scrutiny when somebody on Twitter wants to demolish the claim. Both still place ArgosBrain in a different cost regime entirely.

The unlock is not "we made AI cheaper". The unlock is this becomes a per-PR check, not a quarterly audit. At $0.33 a run, you can wire it into CI on every push and the bill is still under the cost of a single coffee per developer per day.

05The second scan

First scan: 70 seconds. Every scan after: ~10 seconds.

The 70-second wall-clock is dominated by the initial ingest — tree-sitter parses 17,171 files, builds the symbol graph, computes inbound call counts, and stores the brain.bin snapshot. About 60 seconds of that is one-time work.

On a warm brain (a re-scan after a code change), only files whose content hash changed are re-parsed. A typical PR touches 5–50 files. Re-ingest of that delta is sub-second; the security pass on the updated graph is roughly the remaining 10 seconds.

This is what makes the per-PR economics work. You don't pay 70 seconds on every push. You pay 70 seconds once, then ~10 seconds per change after that, until the codebase fundamentally restructures itself.

06Library gaps · published on purpose

Where our default sink library was too coarse — and what v0.8.9 ships to fix it.

The agent driving the audit ran our 22 categories and called out four places where our shipping pattern set was either noisier than acceptable or silent when it should have spoken. We treat these as bug reports, not embarrassments. The agent that audits your code should be able to audit our engine — that's how trust gets built.

Gap 1 · buffer_overflow / gets(

Substring gets( matched 158 Go methods.

Go convention names interface getters with a trailing sPodDisruptionBudgets, GetTargets, *Getter. Our buffer-overflow rule fired on the substring without a word-boundary check, producing 158 false positives on the K8s corpus.

v0.8.9 fix. The pattern now requires a leading word boundary (\bgets() and a per-language allowlist (C / C++ only — there is no Go gets). The same fix applies to strcpy( and sprintf(. Tests in security_sinks::tests::gets_pattern_does_not_match_go_methods guard the regression.

Gap 2 · rce / eval(

Substring eval( matched Go method EvalUser and CEL evaluators.

Common-expression-language evaluation methods (EvalUser, EvalClaimMapping) and a sync-track lazy getter all triggered the RCE pattern. None of them are Python's eval — they're typed method calls.

v0.8.9 fix. Word-boundary on the leading edge so EvalUser( no longer matches the substring eval(. Real Python eval(user_input) still matches because the preceding character is whitespace or punctuation.

Gap 3 · cloud_api_key / key- prefix

Substring key- matched 128 CLI flags and field names.

The Mailgun-key prefix is genuinely key-<32 hex>, but our substring trip-wire fired on every --key-file, signing-key-id, and key-version in the K8s codebase. The companion regex (key-[a-f0-9]{32}) catches real keys; the substring was just adding noise above it.

v0.8.9 fix. Word-boundary on the substring (\bkey-) so the regex pass remains the source of truth for the strict shape, and the substring catches only the stand-alone prefix. A future v0.9.x will add Shannon-entropy filtering on the value to push precision higher.

Gap 4 · weak_crypto · crypto/md5 imports

Engine returned 0 hits for weak_crypto. Regex fallback found 7.

The regex fallback in the audit picked up 7 production imports of crypto/md5 and crypto/sha1. All seven were used for cache keying / endpoint hashing, not for security primitives — appropriate uses, but the engine should have surfaced them so a human reviewer could decide. Silent zero is worse than a flagged-and-reviewed zero.

v0.8.9 fix. Surface every weak-crypto import as a SinkConfidence::PatternOnly finding (the new confidence enum lets agents skip these in fast-path triage but they still show up in the report). v0.9.x will add an "is the hash used in an authentication context" follow-up classifier.

What v0.8.9 also ships. Beyond the four pattern-library fixes, this release adds an OS-level file lock on brain.bin — preventing two argosbrain-mcp processes from concurrently writing to the same brain (a corruption window observed during a customer's Cursor restart). And a new SinkConfidence enum (HighConfidence / StructurallyReachable / PatternOnly / LibraryNoise) so agents can skip known-noisy categories without spending tokens reading them.

07Caveats

What this audit does not claim.

  • Control flow only. ArgosBrain answers "is there a structural path from source to sink within depth N". For field-level data-flow analysis (does req.body.userId reach the sink?) pair with Semgrep Pro or CodeQL — Kubernetes runs both upstream.
  • Static analysis. Reflection (reflect.Call), runtime configuration injection, KEP-driven feature flags, and dynamic plugin loading are invisible to AST-walkers. The skill flags these explicitly in its analysis_blind_spots field when it sees them.
  • Findings are candidates, not confirmed CVEs. The Kubernetes Security Response Committee handles disclosed vulnerabilities. We are reporting structural evidence about the codebase as it exists at v1.32.0.
  • One-off measurement. This was Opus 4.7 driving the agent, on one specific commit. Different agents, different settings, different commits will produce slightly different numbers. The methodology is in the paper so anyone can reproduce.
08Try it

Run the same scan on your own codebase.

One-line install, then point ArgosBrain at any repo on your disk:

curl -fsSL https://argosbrain.com/install | sh
cd ~/your-project
argosbrain ingest .
# In Claude Code / Cursor: invoke the security-review skill

Free tier covers unlimited projects up to 20k symbols each. Pro at $19/seat/month removes the cap and adds the Pro web dashboard. Team Defender ($79/seat) unlocks the GitHub Action for CI integration — coming v0.9.0.