Same codebase — 17,171 files, 303,722 symbols, 2,245,124 call-graph edges. Two different jobs. Security audit: 22 sink categories triaged, zero reachable critical findings, library gaps publicly disclosed. 70 seconds, $0.33. Architectural onboarding: the AI deduced the engineering culture of a 6-month-onboarding monolith. 6 seconds, $0.11. Both runs reproduce on your laptop.
Most security tools demo on a 200-file Express app and call it done. Kubernetes is the opposite end of the spectrum — 17 171 files, 303 722 symbols, audited continuously by Google, Red Hat, the CNCF Security TAG, plus Snyk and Semgrep upstream. If our engine produces noise here, no enterprise will trust it. If it produces clean reachability evidence, we have something real.
This page reports what happened, including the patterns where our default sink library was too coarse (and how v0.8.9 ships a fix). We publish the gaps because if you're a CISO evaluating this, you should see what we already know is wrong before your team finds it on a Friday afternoon.
The triage layer scored every potential sink against four signals — pattern precision, inbound caller count, source-to-sink path within depth 8, and library-noise heuristics. Of the 18 high-score hits found, every one was structurally unreachable from any tracked HTTP entry, CLI handler, env-var reader, or file-read syscall.
The full bucket breakdown:
rce or cmd_injection sink within depth 8.eval( RCE hits + 8 sh -c command-injection hits triaged at score 0.55. All 18 structurally unreachable. Every eval( match was a Go method name (EvalUser, EvalClaimMapping) — CEL evaluation, not Python eval. Every sh -c hit was inside test/e2e/.InsecureSkipVerify sites — all liveness/readiness probes against pod-local self-signed certs, kubelet container lifecycle clients with explicit "must not include credentials" comments, kubeadm bootstrap before TLS trust is established. Each documented inline by upstream.This is not a clean bill of health for Kubernetes — the Kubernetes Security Response Committee handles real disclosures. It is structural evidence that within the call-graph our engine built, no attacker-controlled input reaches a dangerous sink in eight or fewer hops. That's the claim, that's the scope.
Understanding the Kubernetes monolith takes a human engineer six months. Forcing a standard AI agent to read it takes 85 million tokens, endless grep timeouts, and guarantees context rot.
We connected Claude Code to ArgosBrain and asked for a complete architectural tour of Kubernetes v1.32.0. Because the agent could query the deterministic graph instead of reading text, it didn't just list files — it deduced the system's entire engineering culture.
┌── KUBERNETES 1.32.0 GRAPH ─────────────────────┐ │ Nodes (Symbols): 303,722 │ │ Edges (Call-graph): 2,245,124 │ │ Source Files: 17,171 │ │ Function defs: 30,599 │ │ │ │ Argos queries: 11 │ │ Wall-clock time: ~6 seconds │ └────────────────────────────────────────────────┘
staging/.../apimachinery and client-go are the core libraries with the highest fan-in, while the cmd/* binaries are strictly leaf nodes. No human told it; the call-graph did.NewSharedIndexInformer as the central reactive primitive. Its verdict to a future developer: "Once you grok this, you grok 90% of the controllers." A typical RAG would have needed to summarise 200 controller files to reach the same conclusion. The graph reaches it in one query.zz_generated_*.go files (kube's machine-generated deepcopy / conversion code). It correctly excluded those and surfaced the actual human convention: camelCase for private helpers. So the AI writes code that matches what humans wrote, not what code-generators emit.┌── THE TOKEN LEDGER (Code Tour) ────────────────┐ │ Naive RAG / glob baseline: ~85,800,000 tok │ │ ArgosBrain graph traversal: ~7,500 tok │ │ │ │ Token reduction: 99.99 % │ │ Total API cost: $0.11 │ └────────────────────────────────────────────────┘
The gap between "list every file" and "deduce the engineering culture" is the gap between a search engine and a reasoning substrate. ArgosBrain is the substrate; the LLM is the reasoner; together they cover ground neither can cover alone.
The agent driving the security audit (Claude Opus 4.7) consumed approximately 22,000 tokens across 33 MCP tool calls and 8 regex fallback patterns — $0.33. The Code Tour run consumed approximately 7,500 tokens across 11 MCP queries — $0.11. Both at Opus input pricing of $15 per million tokens.
Two baselines for comparison:
| Approach | Tokens | Cost (Opus 4.7) | Reduction |
|---|---|---|---|
| Naive: read every file 5K tok/file × 17,171 files |
~85,800,000 | $1,287 | 99.97% |
| Realistic agent: grep + selective reads ~150 files × 5K tok/file |
~750,000 | $11.25 | 97.1% |
| ArgosBrain (this run) graph + reachability + selective reads |
22,000 | $0.33 | — |
We publish both baselines on purpose. The 99.97% headline number compares against the worst-case "read every file" approach — useful as a ceiling, but no real agent does that. The 97.1% comparison is against a competent agent using grep first; that's the number that survives technical scrutiny when somebody on Twitter wants to demolish the claim. Both still place ArgosBrain in a different cost regime entirely.
The unlock is not "we made AI cheaper". The unlock is this becomes a per-PR check, not a quarterly audit. At $0.33 a run, you can wire it into CI on every push and the bill is still under the cost of a single coffee per developer per day.
The 70-second wall-clock is dominated by the initial ingest — tree-sitter parses 17,171 files, builds the symbol graph, computes inbound call counts, and stores the brain.bin snapshot. About 60 seconds of that is one-time work.
On a warm brain (a re-scan after a code change), only files whose content hash changed are re-parsed. A typical PR touches 5–50 files. Re-ingest of that delta is sub-second; the security pass on the updated graph is roughly the remaining 10 seconds.
This is what makes the per-PR economics work. You don't pay 70 seconds on every push. You pay 70 seconds once, then ~10 seconds per change after that, until the codebase fundamentally restructures itself.
The agent driving the audit ran our 22 categories and called out four places where our shipping pattern set was either noisier than acceptable or silent when it should have spoken. We treat these as bug reports, not embarrassments. The agent that audits your code should be able to audit our engine — that's how trust gets built.
gets(gets( matched 158 Go methods.Go convention names interface getters with a trailing s — PodDisruptionBudgets, GetTargets, *Getter. Our buffer-overflow rule fired on the substring without a word-boundary check, producing 158 false positives on the K8s corpus.
v0.8.9 fix. The pattern now requires a leading word boundary (\bgets() and a per-language allowlist (C / C++ only — there is no Go gets). The same fix applies to strcpy( and sprintf(. Tests in security_sinks::tests::gets_pattern_does_not_match_go_methods guard the regression.
eval(eval( matched Go method EvalUser and CEL evaluators.Common-expression-language evaluation methods (EvalUser, EvalClaimMapping) and a sync-track lazy getter all triggered the RCE pattern. None of them are Python's eval — they're typed method calls.
v0.8.9 fix. Word-boundary on the leading edge so EvalUser( no longer matches the substring eval(. Real Python eval(user_input) still matches because the preceding character is whitespace or punctuation.
key- prefixkey- matched 128 CLI flags and field names.The Mailgun-key prefix is genuinely key-<32 hex>, but our substring trip-wire fired on every --key-file, signing-key-id, and key-version in the K8s codebase. The companion regex (key-[a-f0-9]{32}) catches real keys; the substring was just adding noise above it.
v0.8.9 fix. Word-boundary on the substring (\bkey-) so the regex pass remains the source of truth for the strict shape, and the substring catches only the stand-alone prefix. A future v0.9.x will add Shannon-entropy filtering on the value to push precision higher.
crypto/md5 importsThe regex fallback in the audit picked up 7 production imports of crypto/md5 and crypto/sha1. All seven were used for cache keying / endpoint hashing, not for security primitives — appropriate uses, but the engine should have surfaced them so a human reviewer could decide. Silent zero is worse than a flagged-and-reviewed zero.
v0.8.9 fix. Surface every weak-crypto import as a SinkConfidence::PatternOnly finding (the new confidence enum lets agents skip these in fast-path triage but they still show up in the report). v0.9.x will add an "is the hash used in an authentication context" follow-up classifier.
What v0.8.9 also ships. Beyond the four pattern-library fixes, this release adds an OS-level file lock on brain.bin — preventing two argosbrain-mcp processes from concurrently writing to the same brain (a corruption window observed during a customer's Cursor restart). And a new SinkConfidence enum (HighConfidence / StructurallyReachable / PatternOnly / LibraryNoise) so agents can skip known-noisy categories without spending tokens reading them.
req.body.userId reach the sink?) pair with Semgrep Pro or CodeQL — Kubernetes runs both upstream.reflect.Call), runtime configuration injection, KEP-driven feature flags, and dynamic plugin loading are invisible to AST-walkers. The skill flags these explicitly in its analysis_blind_spots field when it sees them.One-line install, then point ArgosBrain at any repo on your disk:
curl -fsSL https://argosbrain.com/install | sh cd ~/your-project argosbrain ingest . # In Claude Code / Cursor: invoke the security-review skill
Free tier covers unlimited projects up to 20k symbols each. Pro at $19/seat/month removes the cap and adds the Pro web dashboard. Team Defender ($79/seat) unlocks the GitHub Action for CI integration — coming v0.9.0.