Zero-Cost Graph Retrieval at Compiler-Grade Depth for AI Coding Agents

Abstract

We describe Neurogenesis, a graph-first code-memory engine that answers structural retrieval queries for AI coding agents without any LLM call on the read path. The engine is positioned as complementary to text-search primitives such as grep and to file-read inspection: free-text retrieval in comments, log strings, and non-code text remains the appropriate target for grep; structural retrieval — exact symbol resolution, member resolution, exhaustive caller enumeration, call-graph traversal — becomes the target for the graph layer described here. The engine ingests source code into a canonical-identifier graph via a tiered pipeline that selects the highest-precision indexing technology available per language — compiler-grade SCIP indexers where mature, live language-server workspaces where not, and bespoke tree-sitter semantic walkers for the long-tail remainder. File-hash content detection invalidates stale subgraphs on source-tree changes, making re-ingest cost linear in the number of changed files rather than the repository size. Ingest operates in isolated subprocesses with bounded lifetimes; a crashing language server cannot affect the retrieval hot path, which is in-process Rust reading from local bincode-serialised graph storage. We evaluate the engine on the Kubernetes 1.32 corpus of the LongMemCode benchmark — 1,456 scenarios across eight query categories — and report raw accuracy 99.244 %, P99 retrieval latency 0.366 ms, P95 0.267 ms, P50 0.008 ms, and zero monetary cost per thousand retrieval queries on a single Apple M4 Max workstation. We discuss the design-space alternatives rejected and limitations that remain.

Figure 1 — Neurogenesis component block diagram. A tiered ingest pipeline (running inside bounded-lifetime subprocesses) builds a canonical-identifier graph; file-hash invalidation keeps it current. The in-process retrieval API serves any MCP-compatible agent over stdio, with no LLM on the read path.

Introduction

AI coding agents that persist knowledge between sessions need a memory layer whose cost, latency, and accuracy match the expectations of interactive developer work. Three cost dimensions matter: dollars per query (charged by embedding or LLM API calls on the retrieval path), milliseconds per query (P99 matters more than P50 for interactive use), and staleness after source-tree changes (a refactor that renames several hundred symbols should not require re-ingesting the entire repository). Existing general-purpose memory systems for agents typically optimise one dimension at the expense of the others: dollar-cheap retrieval at the cost of LLM calls on writes; fast retrieval at the cost of accuracy on structural code queries; accurate retrieval at the cost of expensive re-ingestion.

This paper describes Neurogenesis, a memory engine specifically designed for the code-memory workload identified in companion work [Jibleanu, 2026a; Jibleanu, 2026b]. Neurogenesis optimises for the structural-query-dominated distribution coding agents actually issue, and accepts the corresponding design constraints: a graph-first storage layer, compiler-grade ingest where possible, a zero-LLM hot path, and content-hash-based incremental updates. The engine serves as the reference adapter in the LongMemCode benchmark [Jibleanu, 2026a] and is the subject of the measurements reported there.

Scope and complementary positioning. Neurogenesis is positioned as complementary to the text-search and file-read primitives an agent already has — typically grep (or ripgrep) and a Read tool — not as a replacement. Free- text retrieval inside comments, log strings, and non-code text remains the appropriate target for grep; inspecting a known file path remains the appropriate target for Read. The graph layer described in this paper covers what those primitives cannot answer in a single call: exact symbol resolution with file-and-line precision, exhaustive caller enumeration over the inverse call graph, override and implementation chains, and a NoConfidentMatch result shape that lets the agent know the answer is absent rather than infer it from empty grep output. The intended deployment is the two layers running side-by-side: grep for surface, the graph for structure.

The contributions of this paper are: (1) a high-level architecture description of a tiered, graph-first, in-process code-memory engine; (2) a justification of each design choice against alternatives, grounded in the structural-versus-semantic taxonomy [Jibleanu, 2026b]; (3) measured operational properties — latency, footprint, re-ingest cost — for the engine running against real open-source corpora; and (4) an explicit discussion of design-space limits and open problems.

4.1 Knowledge-graph memory for agents

Graphiti [Rasmy et al., 2025] and MemGPT / Letta [Packer et al., 2023] are the dominant graph-based agent-memory systems in production use. Both treat memory as a temporal knowledge graph of entities and labelled relations, extracted via LLM from conversational or documentary input. Graphiti requires an external graph database (Neo4j, FalkorDB, Kuzu, or Neptune); Letta maintains a tiered core/archival/recall structure edited by agent self-calls. Both pay LLM cost on write and, in Letta’s case, on read as well. Neither ingests source code as canonical-identifier graphs, and neither exposes structural-code-query primitives.

4.2 Retrieval-augmented code completion

Continue’s @codebase [Continue, 2025] parses source with tree-sitter, embeds top-level function and class bodies, and retrieves top-k chunks on demand. The chunks are text; the retrieval is semantic. Aider’s repository map [Aider, 2023] extracts tree-sitter symbols and ranks files by PageRank over reference edges, injecting the top-ranked identifiers into every prompt. Neither system builds a traversable graph of canonical identifiers, and neither supports queries such as “who overrides method m” or “which callers of function f” without fallback to text search.

4.3 Industrial code indexers

SCIP [Sourcegraph, 2023] is an open-source protocol for representing source-code indexing data. SCIP indexers exist for Rust (via rust-analyzer), Python (via a patched pyright), Go (via scip-go), TypeScript / JavaScript (scip-typescript), Java and Scala (via semanticdb and scip-java), PHP (scip-php), Ruby (scip-ruby), C# (scip-dotnet), and Dart (scip_dart). Sourcegraph uses SCIP to power cross-repository code search across billions of lines of code. SCIP is an ingestion format; it is not a memory engine, nor does it expose retrieval APIs designed for agent consumption. Neurogenesis consumes SCIP as one of its ingest backends, alongside others.

4.4 Language-server protocol indices

The Language Server Protocol [Microsoft, 2016] provides textDocument/documentSymbol and workspace/symbol as primitives that can be used to enumerate symbols in a workspace. Some language ecosystems (Kotlin, Swift) have mature language servers but no production-ready SCIP indexer. We use live LSP ingest opportunistically in those cases.

4.5 Tree-sitter-based semantic extraction

Tree-sitter [Brunsfeld, 2018] is an incremental parser-generator framework with grammars for over 100 languages. It produces concrete syntax trees; it does not perform cross-file symbol resolution, type inference, or import resolution. Using tree-sitter for semantic extraction requires per-language walker logic that maps CST nodes to canonical identifiers — a substantial engineering effort per language but the only option for languages without mature SCIP or LSP support.

4.6 Memory-evaluation landscape and benchmarking-integrity literature

The 2025–2026 wave of survey and benchmark literature on agent-memory systems frames the evaluation context for the present work. Liu et al. [Liu et al., 2026] survey memory in LLM-based agents and distinguish factual, experiential, and working memory along the formation–evolution–retrieval lifecycle, treating graph memory as a 2026 production-grade approach distinct from vector similarity over chunks. MemoryBench [Hou et al., 2025] proposes a benchmark covering four agent-memory competencies — accurate retrieval, test-time learning, long-range understanding, and selective forgetting — positioned as a multi-turn evaluation harness. Neurogenesis is positioned in this landscape as a memory engine restricted to the accurate-retrieval competency, with explicit non-coverage of test-time learning and selective forgetting (see Section 9.4).

A second body of recent work concerns benchmarking integrity for LLM-driven code tasks. Wang et al. [Wang et al., 2025] document contamination patterns in SWE-Bench Verified — including up to 34.9 % five-gram overlap between model output and benchmark solutions — and recommend temporal controls, post-cutoff repository selection, and cross-benchmark validation as defences. Section 7.6 of the present paper discusses why these critiques are structurally inapplicable to a deterministic, no-LLM retrieval engine; we cite this literature to acknowledge the broader concern and to make explicit that our methodology is not subject to it.

Design Goals

Neurogenesis is designed against four explicit goals.

G1. Structural correctness at compiler-grade depth. For every language in the target set, structural queries — does this symbol exist, list methods of a class, enumerate overrides — must return exact, reproducible answers. This rules out approximate retrieval on the structural path.

G2. Sub-millisecond P99 retrieval at laptop resource budget. Interactive coding UX lives at the tail. A memory layer that serves an agent mid-task cannot pause the user. Retrieval must be graph-local and in-process; retrieval cannot call out to external services or spawn subprocesses per query.

G3. Zero monetary cost on the retrieval path, forever. The read path must never call an LLM, never call an embedding API, never make a network request. This constrains the storage model (all structure must be pre-computed at ingest time) but removes an entire class of operational failure modes.

G4. Re-ingest cost linear in the diff, not in the repository. Developer workflows issue branch switches, rebases, and partial edits constantly. A memory engine whose ingest cost is proportional to the repository size creates back-pressure on normal git operation. Re-ingest must be O(changed files).

These four goals constrain the design space severely. Most commercial agent-memory products satisfy two or three; we argue Neurogenesis is among the first to satisfy all four on the code-memory workload, at the cost of narrowing the target domain from general memory to code specifically.

Architecture

6.1 Components

Neurogenesis consists of three components, connected in a pipeline:

Ingest pipeline: consumes a source-tree commit SHA and produces a canonical-identifier graph persisted to local on-disk storage.
Graph store: on-disk bincode-serialised graph, with an in-memory working set for query serving.
Retrieval API: exposes the structural query primitives over a stable protocol (MCP stdio for the production deployment, but the API surface is transport-independent).

A persistent file-watcher component is optional and handles the O(changed files) incremental update path.

Figure 1 in this paper shows these components as a block diagram at a level of detail that illustrates the architecture without revealing internal types.

6.2 Tiered ingest pipeline

The ingest pipeline selects one of three backend strategies per language, chosen to maximise structural precision given the tooling available for that language.

Tier 1 — Compiler-grade SCIP indexing. For languages with a mature SCIP indexer, ingest drives the indexer against the source tree. The indexer runs the language’s compiler frontend and produces a SCIP index containing canonical symbol IDs, cross-file references, containment relations, and type information. Neurogenesis parses the SCIP index and inserts its nodes and edges directly into the graph. The indexer subprocess terminates at the end of ingest; no long-lived process is required.

Tier 2 — Live language-server ingest. For languages with a mature language server but no production SCIP indexer, ingest drives the language server over the LSP protocol. The workspace is opened, documentSymbol and workspace/symbol queries enumerate the symbols, and per-language post-processing maps LSP symbol kinds back to our canonical schema. Ingest is guarded by per-file and per-session timeouts, and the language server runs as an isolated subprocess whose lifetime is bounded by the ingest run.

Tier 3 — Bespoke tree-sitter semantic walkers. For languages without either a SCIP indexer or a mature language server, ingest uses tree-sitter grammars augmented with per-language semantic hooks that extract canonical identifiers from the concrete syntax tree. These walkers encode language-specific structural patterns — for example, languages where functions are defined by assignment rather than declaration require recognising assignment-to-function as a function-declaration event; statement-based grammars require per-statement parsing with context preservation; languages with block-label semantics require label-aware walkers. The walkers do not perform cross-file type inference, so their output is structurally shallower than Tier 1 but considerably richer than a generic tree-sitter surface extraction.

The tier is selected at build time based on the source tree’s detected languages. A single ingest run may use all three tiers in parallel across different file subsets of the same repository.

6.3 Graph storage

The graph is a set of nodes representing canonical identifiers and a set of labelled edges representing structural relations between them. Edges carry a label from a fixed schema — containment, reference, inheritance, override, and similar — derived from the tier’s source index.

The graph is persisted on disk in a compact binary serialisation. The hot working set is mapped into memory at retrieval-server startup; cold portions are spilled to disk with an LRU-like policy. Retrieval does not allocate on the typical path: a query walks pre-materialised edges in memory and returns a set of canonical-identifier strings.

6.4 Retrieval API

The retrieval API exposes structural primitives as named operations over the graph. The operation surface in the production deployment includes symbol existence checks, member resolution, containment enumeration, caller enumeration, override enumeration, and a small number of convenience operations for common agent workflows. Each operation translates to a deterministic graph query with predictable latency profile.

The API is transport-independent: the same operations are exposed over MCP stdio for IDE integration and over an in-process Rust interface for embedded use.

6.5 Staleness and incremental update

On ingest, every file carries a content hash (a collision-resistant hash of the file bytes) stored alongside its canonical-identifier nodes. On re-ingest, each source file’s current hash is compared against the stored hash; files whose hashes match are skipped entirely without parsing. Files whose hashes differ have their existing subgraph removed and rebuilt. The cost of re-ingest is therefore proportional to the number of changed files, not the repository size.

An optional file-watcher component observes the source tree between ingest runs and updates the graph incrementally on save events. The watcher is guarded by directory skip-lists (excluding build output and dependency folders), debouncing (to fold rapid sequences of save events from editors using atomic-save patterns), and per-subtree rate limits (to prevent runaway processes from wedging the host). Watcher operation is opt-in; the pull-based ingest path remains the correctness path.

6.6 Subprocess isolation and zero-panic guarantees

All external processes — SCIP indexers, language servers, tree-sitter walker invocations — run as operating-system subprocesses with explicit lifetime bounds. When the ingest run ends, subprocesses are killed. A subprocess crash surfaces as a Result::Err in the Rust parent; it cannot propagate as a panic into the retrieval path.

The retrieval hot path — the MCP stdio loop that serves the agent — is written without unwrap() in library code. Every fallible operation returns Result. The retrieval path never spawns subprocesses, never performs I/O beyond reading from the local graph store, and holds no locks that an ingest path holds. Ingest and retrieval are independent execution domains that share the graph through a controlled write-snapshot protocol.

6.7 Block diagram

Figure 1 — Component block diagram.

A simple block diagram. Three rows. Top row: source tree on the left, tiered ingest pipeline (three boxes labelled Tier 1 SCIP, Tier 2 LSP, Tier 3 tree-sitter) in the middle, arrow to the right. Middle row: graph store as a single cylinder, in-memory working set above it. Bottom row: retrieval API as a box at right, MCP stdio as the transport on the far right, agent symbol on the right edge. No internal types, no parameters, no specific languages labelled against tiers. What this figure shows: how the three components connect. What it deliberately does not show: internal storage layout, specific parameter values, per-language tier assignments, or any detail that would enable implementation replication.

Engineering Properties

7.1 Measured latency

We measure retrieval latency on the Kubernetes 1.32 corpus of the LongMemCode benchmark — 1,456 scenarios distributed across eight query categories. On a single Apple M4 Max workstation (48 GB RAM, macOS), median per-query latency is P50 0.008 ms; the bulk of the distribution falls below P95 0.267 ms, and the worst-case end is bounded by P99 0.366 ms. Latency is dominated by the cost of edge traversal plus result serialisation; there is no component of the retrieval path that scales with repository size given a bounded result set. Figure 2 shows the full cumulative distribution function of per-query latency across the benchmark; Table 1 reports the per-category accuracy breakdown that accompanies these latencies.

Figure 2 — Per-query latency CDF across LongMemCode kubernetes-2k.

A cumulative distribution function chart. X-axis: per-query latency in milliseconds, log scale. Y-axis: fraction of queries at or below that latency, from 0 to 1. A single curve representing the flat union of per-query timings across the 1,456 scenarios in kubernetes-2k. What this figure shows: the latency distribution has no long tail — the curve reaches the top within two orders of magnitude of the median. What it deliberately does not show: per- category breakdown, or any architectural attribution for why the tail is short.

Table 1 — Per-category accuracy on LongMemCode kubernetes-2k (1,456 scenarios). Source data: results/argosbrain-kubernetes-2k.json in the LongMemCode repository.

Category	Scenarios	Passed	Pass rate
ApiDiscovery	256	256	100.00 %
BugFix	292	292	100.00 %
Completion	468	457	97.65 %
Config	52	52	100.00 %
ControlFlow	33	33	100.00 %
FeatureAdd	152	152	100.00 %
Refactor	121	121	100.00 %
TestGen	82	82	100.00 %
Total	1,456	1,445	99.244 %

The single category in which any scenarios fail is Completion (457 of 468 passed, 97.65 %); the eleven misses are distributed across ambiguous bare-name lookups where the corpus ground-truth selected an obscure variant. The remaining seven categories pass every scenario.

7.2 Measured memory footprint

Memory footprint, measured as resident-set size during steady-state query serving, is in the low hundreds of megabytes for repositories of several hundred thousand symbols. Footprint scales approximately linearly with the number of stored nodes and edges, with a constant factor set by the serialisation format and the in-memory index structures.

Limits at extreme scale. The measurements in this paper cover repositories up to the scale of the largest corpora in LongMemCode (several hundred thousand symbols). We have not benchmarked repositories in the Linux-kernel or Chromium class (on the order of several million symbols). At that scale an all-in-memory graph would cross the tens-of-gigabytes threshold and become impractical on laptop-class hardware. The architecture anticipates this by leaving room for a tiered-storage layer: hot subgraphs remain in process memory, cold subgraphs spill to a local key-value store (SQLite, RocksDB, or LMDB are the obvious candidates). The retrieval API does not change — a cold-tier fetch becomes a hidden I/O inside a traversal step, with a latency tax that can be measured and reported per query class. We flag the tiered-storage extension here as a deliberate scope boundary rather than an oversight; every latency and footprint claim in the present paper is bounded to the measured scale.

7.3 Measured cost

Retrieval has no monetary cost per query. There is no LLM call, no embedding call, no external API call on the read path. The ingest cost is one-time per changed file: running the tier’s backend on the file, parsing its output, and inserting into the graph. Compilation or tree-sitter parsing cost is the dominant term.

Figure 3 — Cost per thousand retrieval queries, comparative.

A horizontal bar chart. Y-axis: systems (Neurogenesis / structural reference, plus placeholder bars for any other adapter present in LongMemCode at submission time). X-axis: cost in US dollars per 1 000 retrieval queries, log scale. Source data: Neurogenesis at $0 (measured, no LLM on read path); other systems inferred from their publicly documented pricing and the prompt tokens they inject per query (exact method described in the caption). What this figure shows: the architectural choice of zero-LLM retrieval produces an order-of-magnitude cost gap versus any system that injects retrieved content into an LLM prompt. What it deliberately does not show: internal explanation of how zero-LLM retrieval is achieved — that is the architecture itself.

7.4 Re-ingest cost

Re-ingest on a zero-diff source tree (no file content changes) completes in under five seconds for a large repository. Re-ingest after a three-hundred-file diff completes in a few seconds for compiler-grade-ingested languages and sub-second for tree-sitter-ingested languages. The cost is linear in the number of changed files.

7.5 Zero-panic property

The retrieval hot path has no unwrap() in library code; every fallible operation is threaded through Result. Ingest subprocesses cannot propagate panics into the retrieval path because they are separated by operating-system process boundaries. A malformed input file fails ingest for that file, logs a warning, and does not block the ingest run from completing or the retrieval server from serving previously-ingested queries.

7.6 Non-applicability of LLM-memorization critiques

Recent work has documented benchmark contamination patterns in LLM-driven code-task evaluations — for example, the SWE-Bench Illusion analysis [Wang et al., 2025] reports up to 34.9 % five-gram overlap between model output and benchmark solutions on SWE-Bench Verified, indicative of training-data memorisation rather than genuine reasoning. The critique is structurally inapplicable to the measurements reported in this paper. The retrieval path of Neurogenesis contains no LLM inference of any kind: a query reaches the engine, traverses the local graph, and returns a deterministic result drawn from the structural representation of the repository. There is no probability distribution to bias toward memorised text, no token sampling, and no neural component participating in the read response. Two consequences follow. First, identical inputs produce identical outputs across runs and across machines, modulo wall-clock variance from background OS load. Second, the headline accuracy of 99.244 % on kubernetes-2k cannot be attributed to memorisation of the corpus by a model — it reflects the structural agreement between the engine’s ingested graph and the benchmark’s pinned ground truth. Reviewers concerned about contamination should focus their scrutiny on the ingest pipeline and the corpus oracle, both of which are open to inspection in the respective public repositories listed in the Reproducibility section.

Design-Space Alternatives

8.1 Vector-only storage

A vector-only memory engine would embed each code chunk and retrieve via similarity. This is the default paradigm in the LLM application layer. We reject it for Neurogenesis because the structural-query distribution [Jibleanu, 2026b] penalises it: vector retrieval cannot natively return empty sets for hallucinated identifiers, cannot enumerate overrides, cannot follow inheritance edges. A vector component is complementary to the graph — Neurogenesis can coexist with one — but it cannot replace the graph for structural workloads.

8.2 External graph database

An external graph database (Neo4j, FalkorDB, Kuzu, Neptune) is the path taken by Graphiti and Zep. We reject it because it violates Goal G2 (sub-millisecond P99 at laptop resource budget): network round-trip costs dominate graph-local traversal costs. In-process Rust storage gives us deterministic latency; database-backed storage does not.

8.3 LLM-in-the-loop on the retrieval path

Letta’s read path calls an LLM tool. MemGPT’s read path calls an LLM tool. We reject the pattern because it violates Goal G3 (zero monetary cost per query). A memory engine that charges per read scales its cost with agent usage; a memory engine that pre-computes its structure at ingest and serves from that structure does not.

8.4 Incremental indexer running against the source

Some industrial code-intelligence products operate an incremental indexer that continuously maintains an up-to-date index against the source tree. We reject the continuous path in favour of a content-hash pull model for two reasons: it simplifies operation (there is no daemon to monitor), and it makes the cost of re-ingest explicitly attributable rather than amortised into background CPU use.

8.5 Full tree-sitter everywhere

A simpler design would use tree-sitter for every language rather than a tiered pipeline. We reject it because tree-sitter produces surface syntax and does not perform cross-file symbol resolution. The head of the language distribution — where most real code is written — has compiler-grade indexing available, and using it produces substantially richer graphs. The tiered approach pays extra engineering cost upfront to hit the richer indexing when available.

Limitations

9.1 Tier coverage is uneven

Tier 1 (compiler-grade SCIP) covers a subset of the languages we target. Tier 2 (live LSP) covers languages with mature language servers but no SCIP indexer. Tier 3 (tree-sitter walkers) covers the remainder. The structural richness of the resulting graph is correspondingly uneven: a refactor-audit query on a Tier 1 language is backed by cross-file type resolution; the same query on a Tier 3 language is backed by syntactic inference with known gaps. Users working primarily in Tier 3 languages will see a larger residual gap between Neurogenesis and a hypothetical perfect indexer than users working in Tier 1 languages.

9.2 Ingest is not instant

The content-hash skip makes re-ingest O(changed files), but the first-time ingest of a repository pays the full cost of running every file’s tier backend. On large repositories, first-time ingest can take minutes. We consider this acceptable — it amortises across sessions — but we name it explicitly.

9.3 Tier 2 inherits language-server variance

Live LSP ingest is gated by language-server quality. Language servers are notorious for memory leaks, crashes, and workspace-load latency variance. The subprocess-isolation and timeout model [Section 4.6] bounds the blast radius, but does not eliminate it: an ingest run against an uncooperative language server takes longer or fails on that language specifically, without affecting retrieval availability for already-ingested data.

9.4 Semantic queries require additional infrastructure

As argued in companion work [Jibleanu, 2026b], structural and semantic queries are distinct. Neurogenesis in its currently described form handles structural queries. A complete production memory layer for coding agents benefits from a companion semantic-retrieval component, which can share the same ingest pass but uses an embedding index alongside the graph. We do not describe such a companion in this paper.

9.5 Team sync is unimplemented

Neurogenesis is local-first by design: ingest, storage, and retrieval all happen in-process. Multi-user team memory with shared indices and synchronisation across user accounts is not implemented. Enterprises with these requirements today should use conversational-memory products with team-sync support; we discuss this gap as future work.

9.6 Single-corpus reporting

The accuracy and latency numbers reported in Section 7 are measured on a single corpus — Kubernetes 1.32, the kubernetes-2k scenario set of LongMemCode (1,456 scenarios, eight categories). We chose a single corpus deliberately to make the per-scenario inspection in Table 1 tractable and the reproduction recipe in the Reproducibility section short. The LongMemCode repository ships additional corpora (small, medium, and large open-source codebases across multiple languages and ecosystems) that other adapter authors can use to cross-validate or challenge the kubernetes-2k result; we do not report on those here in order to keep the present paper focused on a single, fully described measurement. We acknowledge that single-corpus framing is weaker than multi-corpus reporting and consider broader cross-corpus measurements a direct extension of this work.

We conclude by situating Neurogenesis against adjacent systems along the architectural axes that drove our four design goals. Table 2 lists verifiable architectural properties of each system as documented in their own source code, protocol specifications, or public engineering write-ups; we deliberately omit numerical performance comparisons because we have not independently reproduced any benchmark numbers attributed to the other systems.

System	Storage layer	Retrieval read path	Update model
Neurogenesis	In-process Rust + bincode on local disk	Deterministic graph traversal; no LLM, no embedding, no external API	Content-hash diff per file at ingest invocation
Graphiti / Zep	External graph database (Neo4j, FalkorDB, Kuzu, or Neptune)	Database-backed traversal; entity extraction via LLM at write time	Append-on-event with periodic LLM-driven consolidation
Mem0	External vector store + optional graph layer	Vector similarity lookup; LLM-extracted facts injected into the prompt	LLM-driven extraction at write time
Letta	Tiered core / archival / recall buffers managed by the agent	Agent calls memory-management tools that invoke LLM operations	Agent-orchestrated reads and writes inside the LLM loop
Cursor Memories	Prompt-prefix store inside the host editor	Selected memories injected into the agent prompt; not a programmatic query API	Prompted creation by the agent on user request
Continue `@codebase`	Embedding index over tree-sitter chunks	Embedding lookup; chunks injected into the prompt	Re-embed on file change
Aider repo map	Stateless — recomputed per request	PageRank over tree-sitter symbol surface, packed into a token budget	None (no persistent index)

Each row of Table 2 describes a structural property — what kind of storage the system uses, what mechanism it invokes on a read, and how it stays current — that the reader can verify against the cited system’s own documentation or source. The four design goals stated in Section 5 are realised in Neurogenesis through the specific combination of choices in row 1; whether and to what extent any other row’s combination realises the same goals is a question we leave to those systems’ own published measurements.

Conclusion and Future Work

We have described Neurogenesis, a graph-first code-memory engine designed for AI coding agents on inner-loop workloads. The engine satisfies four design goals simultaneously — structural correctness at compiler-grade depth, sub-millisecond P99 retrieval, zero monetary cost per read, and O(changed files) re-ingest — by choosing a tiered ingest pipeline, in-process Rust graph storage, and a retrieval API that exposes structural graph primitives directly. Measured on the LongMemCode kubernetes-2k corpus (1,456 scenarios across eight query categories), the engine reaches raw accuracy 99.244 % at P99 0.366 ms latency and $0 per 1,000 queries; memory footprint stays in laptop resource budgets for realistic repositories.

Future work falls into three branches. First, expanding tier coverage, particularly Tier 1 (SCIP) support for additional languages as upstream indexers mature. Second, companion semantic-retrieval infrastructure that shares ingest with the graph and addresses the non-structural portion of the coding-agent query distribution. Third, team-sync and multi-user deployment patterns that preserve the local-first operational model while allowing shared indices for collaborative workflows.

Reproducibility

All numbers reported in Section 7 and in the abstract are produced by the public LongMemCode benchmark and reproducible from the source materials below.

Benchmark corpus & harness: github.com/CataDef/LongMemCode — MIT-licensed; the kubernetes-2k corpus contains 1,456 scenarios across eight categories (ApiDiscovery, BugFix, Completion, Config, ControlFlow, FeatureAdd, Refactor, TestGen). Corpus identity hash: fnv64:1451e14103faab89.
Result file: results/argosbrain-kubernetes-2k.json in the LongMemCode repository carries the run summary (p50_latency_ms, p95_latency_ms, p99_latency_ms, raw_accuracy, weighted_accuracy, cost_per_1k_queries_usd) and the per-category breakdown reproduced as Table 1 of this paper.
Engine source: github.com/CataDef/neurogenesis — the Rust implementation of the architecture described in Section 6, distributed as the neurogenesis-bundle crate consumed by the LongMemCode adapter. The same engine is distributed in pre-built form as the ArgosBrain product (argosbrain.com); the research and product names refer to the same code base.
Hardware: Apple M4 Max, 48 GB RAM, macOS; a single workstation, no GPU on the read path.
Reproduction: clone the LongMemCode repository, install the engine via argosbrain init, then run the adapter against the kubernetes-2k scenarios with the harness documented in docs/V0.3_METHODOLOGY.md. The result file above regenerates within ±0.5 % of the headline numbers on comparable hardware, with the variance set by ingest seed and background OS load.

References

@inproceedings{aider2023repomap,
  title={Repository Map: Scaling to Large Codebases with Tree-sitter and PageRank},
  author={Aider Team},
  year={2023},
  url={https://aider.chat/2023/10/22/repomap.html}
}

@inproceedings{brunsfeld2018treesitter,
  title={Tree-sitter: An Incremental Parsing System for Programming Tools},
  author={Brunsfeld, Max},
  year={2018},
  url={https://tree-sitter.github.io/tree-sitter/}
}

@inproceedings{chhikara2024mem0,
  title={Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory},
  author={Chhikara, Prateek and others},
  booktitle={arXiv preprint arXiv:2504.19413},
  year={2025}
}

@inproceedings{continue2025codebase,
  title={@codebase Retrieval Architecture},
  author={Continue Dev Team},
  year={2025},
  url={https://docs.continue.dev/customize/deep-dives/codebase}
}

@misc{hou2025memorybench,
  title={MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems},
  author={Hou, Yufei and others},
  year={2025},
  howpublished={arXiv preprint arXiv:2510.17281},
  url={https://arxiv.org/abs/2510.17281}
}

@misc{jibleanu2026longmemcode,
  title={LongMemCode: A Deterministic Benchmark for Code-Memory in AI Agents},
  author={Jibleanu, Aurelian},
  year={2026},
  note={Companion paper and MIT-licensed benchmark repository}
}

@misc{jibleanu2026taxonomy,
  title={Structural vs Semantic Retrieval in Code-Memory: A Query-Type Taxonomy},
  author={Jibleanu, Aurelian},
  year={2026},
  note={Companion paper}
}

@misc{liu2026memorysurvey,
  title={Memory in the Age of AI Agents: A Survey},
  author={Liu, Shichun and others},
  year={2026},
  howpublished={arXiv preprint arXiv:2512.13564},
  url={https://arxiv.org/abs/2512.13564}
}

@inproceedings{microsoft2016lsp,
  title={Language Server Protocol Specification},
  author={Microsoft},
  year={2016},
  url={https://microsoft.github.io/language-server-protocol/}
}

@inproceedings{packer2023memgpt,
  title={MemGPT: Towards LLMs as Operating Systems},
  author={Packer, Charles and others},
  booktitle={arXiv preprint arXiv:2310.08560},
  year={2023}
}

@inproceedings{rasmy2025zep,
  title={Zep: A Temporal Knowledge Graph Architecture for Agent Memory},
  author={Rasmy, Preston and others},
  booktitle={arXiv preprint arXiv:2501.13956},
  year={2025}
}

@inproceedings{sourcegraph2023scip,
  title={SCIP: The Source Code Intelligence Protocol},
  author={Sourcegraph},
  year={2023},
  url={https://github.com/sourcegraph/scip}
}

@misc{wang2025swebenchillusion,
  title={The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason},
  author={Wang, Yuanyuan and others},
  year={2025},
  howpublished={arXiv preprint arXiv:2506.12286},
  url={https://arxiv.org/abs/2506.12286}
}

Appendices

13.1 Appendix A — Protocol: ingest backend abstraction

Pseudocode interface for the abstract ingest backend (not the Rust trait definition — a simplified pseudocode that conveys the shape without disclosing the trait’s internals). One page. Signatures only. No implementation bodies.

13.2 Appendix B — Protocol: retrieval API surface

The MCP-exposed retrieval operations, with expected input and output shapes. This is already public in the MCP schema we ship, so it is safe to reproduce here. One page.