DeepInit — Grounded, verified code-truth for your agent and you

How it works

Seven waves. Parallel. Adversarial. Thorough.

It comes down to three moves — parse, understand, verify. Most tools hand your whole repo to an LLM and hope; DeepInit parses your code first (real AST parsing via Graphify, 25 languages), reasons about meaning on top, then verifies every finding against the code before it's written. Under the hood that's seven waves: parallel subagents analyze each component and then the patterns across them, before findings are cross-checked and adversarially reviewed.

Preflight no tokens

Auto-detect tools, estimate the token cost up front, check the database, get permissions. Nothing runs until you approve it.

Discovery no tokens

Scan the tree, detect the stack, build the structural graph, order components by dependency, read git history — all deterministic, before any model runs.

Vertical analysis parallel · heaviest

Deep into each component, leaves first. Parallel subagents pull business rules, workflows and integration points — grounded to file:line.

Horizontal analysis parallel

Across all components — the patterns no single-component pass can see: shared tables, end-to-end workflows, bounded contexts.

Cross-references

Unify it — entity ↔ component ↔ table maps, rule-to-workflow links, coverage-gap detection across the whole set.

Filter, verify & review the gates

Drop anything inferable; every surviving claim must resolve to real code; a critic agent challenges the findings (0–3 cycles).

Generation

Emit the two tiers — lean root + deep .ai/docs/ — plus a Claude Code skill package. Backup first; re-check every claim against the code.

Inside waves 3 & 4 — go deep, then go wide

Wave 3 Vertical — deep into each component

auth/ ▼

business rules

DB schema

workflows

the why

billing/ ▼

business rules

DB schema

workflows

the why

orders/ ▼

business rules

DB schema

workflows

the why

then — go wide

Wave 4 Horizontal — patterns across them

auth/billing/orders/

shared tables · end-to-end workflows · bounded contexts · domain rules

Your database, and what stays private

DeepInit reads your live database schema read-only — SQL this release (Postgres, MySQL, SQLite), where it compares the live schema against your code and flags the drift. NoSQL stores (Mongo, Redis and others) are stub-level for now — honestly labeled, not implied as done. A secret/PII redaction gate scrubs anything sensitive before a single file is written to disk. (A cross-model verification pass — a second model double-checking findings — is on the roadmap, not shipped yet.)

It tells you the cost before it spends a token. A preflight shows the estimate; nothing runs until you approve it, under a ceiling you set. After the first run, /deep-init:refresh re-reads only what changed and /deep-init:check costs zero tokens — so you're never surprised by the bill.

When it has to ask, it asks in plain language — once. If a run needs a decision — the cost of a big repo, a database it found, an existing CLAUDE.md — DeepInit asks one plain question, shown once, with only the choices that apply, and it states the outcome, not an internal setting:

Cost is framed as scale, not dollars. “This is a large codebase — the full deep analysis will take a while and use a noticeable chunk of your Claude usage” → Full deep analysis · Faster, lighter pass · Just the main app code · Cancel. The $25 guard is a spend cap for pay-per-use API billing, not a plan limit — on a Claude subscription you aren’t billed per run, so it’s just a “this is a big repo — proceed?” check.
The database, in one read-only question. “I found a database — read it live to check the real schema? (Read-only; I never touch production.)” with a Dev / Staging / Prod picker — and a production database is automatically declined to code-only.
Your existing file, your call. “I can update it (your exact file is saved to a restorable backup), preview my version beside it, or just write the docs.”

This is enforced, not cosmetic: a rule (R10) bans internal codes and jargon — an IF-2, “the R7 gate”, “review cycles”, “SARIF” — from anything you see, with an automated check on every spec’d prompt so it can’t regress.

The philosophy: tokens are fuel, not a constraint.

DeepInit burns tokens generously on the first pass — parsing, inspecting your database, cross-checking and verifying every finding — because one thorough analysis is worth far more than a cheap guess your agent will trust for months. Quality is the default, not an upsell.

And counterintuitively, it costs you less.

Two ways. At runtime: your agent reads a lean, ready-made brief instead of burning tokens re-exploring the codebase from scratch every session — and a tight file is cheaper per task than a bloated one (the same ETH study measured 20%+ more cost from oversized context).

Over time: you pay for the deep analysis once, then /deep-init:refresh refreshes only the diffs and /deep-init:check is free — instead of re-deriving the whole picture every session, on every machine, for every teammate. Spend on depth once; coast for months.

Stays current

Run it once — then it stays current, proportional to the diff.

After the first run you never re-pay for the whole repo. An edit re-documents only its blast radius — the components you touched, plus anything whose public interface actually moved. Here is exactly how /deep-init:refresh stays proportional to the change:

Detect 0 tokens · deterministic

A content_hash per component, compared against the stored manifest by an authoritative symmetric set-diff (stored + current keys). git diff and the commit breadcrumb are only accelerators — so a deleted module, or a repo with no git history, is still caught.

Mark the dirty set

Just the components whose content actually changed — nothing else is touched.

Skip safely — the interface-hash test the cost saver

Recompute each dirty component’s public-surface hash. A body-only refactor re-analyzes that one component but skips its dependents (their view is provably unchanged); only a changed export marks the transitive dependents dirty.

Always re-run the whole-system docs the safety net

The five cross-cutting docs re-run regardless of what changed — because a cross-component effect (a new circular dependency, a shifted end-to-end workflow) is invisible from any single component’s diff. This is what makes the step-3 skip a reversible optimization, not a correctness risk.

Re-emit only the affected files

Filter, redact, re-verify every citation, then write only the changed files — inside the owned-region markers, with a dated reversible backup. Issues are diffed against a symbol-keyed baseline (new / persisting / resolved / regressed), so a line-shift never re-churns a finding.

Two things it therefore guarantees — the two ways docs usually rot, closed:

The guarantee

A real interface change can’t silently skip a dependent

Even on the grep path (no precise parser present), DeepInit reconciles the public surface it captured against export-indicator tokens — an unusual form (an export *, a CommonJS module.exports, a dynamic __all__) marks the surface “incomplete” and conservatively re-checks dependents. A breaking change never quietly skips the code that needed it.

The guarantee

A removed file never leaves an orphaned doc

Because detection is a symmetric set-diff (stored vs. current), a deleted or moved component is caught and its docs archived — even with no git history, or a shallow / rewritten ref. No stale page outlives the code it described.

And it tells you — for free — the moment the docs fall behind.

Underneath is one 0-token, no-LLM staleness check (/deep-init:check): it hashes the files it documented and compares them to your working tree by that same symmetric set-diff — so a deleted file, or a repo with no git history, is caught too. Two plugin-shipped hooks surface it proactively, both calling that one script: when you open a session, and when you send the first prompt of a stale one — the second catches staleness that shows up mid-session, right after you commit. They share one once-per-session gate, so you’re offered a refresh at most once per session — never spammy, and silenced in one line (“Don’t ask in this repo”, or /deep-init:customize → Freshness).

When the docs are behind, DeepInit offers a one-click refresh first, before your task — a simple pick: Update now (it runs /deep-init:refresh for you) · Not now · Don’t ask in this repo. It never runs the costly update on its own — you decide. And the offer now shows what actually changed — the files that drifted and which components, not just a count — so you can judge whether a refresh is worth it. A real headless auto-refresh exists but is off by default (the only level that spends tokens), and nothing ever auto-commits — you always review the diff.

The honest part: a git hook can’t summon an AI session, so DeepInit doesn’t pretend your docs silently regenerate on every commit. What it guarantees is that staleness is made visible and a refresh is one click away — reliably, the moment you open a session or send your first prompt.

The problems layer

The problems fall out as a byproduct — grounded, ranked, report-only.

Because DeepInit already extracts your rules, your schema, and the why, the problems surface as a byproduct — now across about ten issue families plus a class-conformance census, every finding grounded to the line, framed as “likely” rather than asserted, and report-only — it never touches your source. Anything a linter already catches is suppressed, because false positives are what kill tools like this. Five of the kinds it looks for:

Your database has drifted from your code

Code that still reads a column the live database no longer has. Schema-diff tools check the DB against a declared schema; this checks it against your actual code — on legacy code with no clean schema.

The code contradicts a decision you made

Code that breaks a decision you recorded, or a "temporary" hack that's quietly load-bearing — only visible when the documented why is on hand.

A rule is enforced in some places, not others

A business rule applied on one path but missing on another that writes the same data — access-control gaps included (surfaced as rule violations, not a security claim).

Two components are secretly coupled

Parts quietly sharing a table with no interface between them — change one, break the other.

Where to look first

Everything ranked by what to fix first — how often it changes, how few people understand it, how critical it is, how thin the tests are.

The goal — a short, trustworthy list worth a human's attention, not a wall of warnings. Findings are framed as likely and grounded to a file:line rather than proven, and this isn't a security product.

One report you can read — and act on.

📖

One report — Docs · Insights · Map

A single self-contained, offline report.html: the browsable docs (search, a component tree, an architecture overview, a decisions timeline, jump-to-file:line), the issue/metrics dashboard, and an interactive Map view — three top-level views merged into one file with a ⌘K palette. Vanilla JS, no network. (It supersedes the old docs-viewer.html + dashboard.html, now redirect stubs.)

🌐

Read it in your language

One command emits report.<lang>.html — Spanish and Hebrew (full RTL) built in, any other language on demand, with an in-app switcher. English stays the canonical analysis; grounded tokens (file:line, code, record IDs) are masked and verified, so a translation can never corrupt a grounded claim — and any miss falls back to English, never a fabricated translation.

📊

Risk, ranked — with the graph

The Insights view shows real composite risk (severity × criticality × churn × bus-factor × coverage) when the repo has the signals — and honestly says “unavailable” when it doesn’t, never a fake zero. And a first-class, interactive Map view turns the component-dependency graph DeepInit already computes into a navigable explorer — pan, zoom, filter by risk, and click any node to jump straight to that component’s docs (it degrades to a static rendering when there’s no graph). And it’s for you, the human — not a place your agent has to go and query: your agent already carries these facts in the context it loads, so the Map is just a window onto the same verified model, not where the answer lives.

🔄

Shows up in your tools

Findings appear in GitHub code scanning and your IDE alongside everything else — via SARIF, the standard format those tools already read. No custom integration.

How it compares

Wikis and graphs give you a place to ask. DeepInit puts the answer where your agent already looks.

Wikis, code graphs, and index tools give you a separate place to go ask questions — a system that's only as current as its last crawl. DeepInit writes verified markdown straight into the context files your agent already loads — CLAUDE.md for Claude Code, AGENTS.md for the rest — so the context is just there, in front of the model, on every run. And it carries a second axis nothing else does: it's measured. The same file:line grounding that makes the context trustworthy is what lets it flag problems without crying wolf — a precision discipline we test on every change, not a promise.

Two rows are the whole story — start here, then the full table substantiates the rest.

Live database vs. your code

It spots the drift nobody else looks for.

DeepInit compares your live database schema against what the code actually reads and writes, and flags where they've diverged — a column the schema dropped that a job still reads. Schema-diff tools check the DB against a declared schema; this checks it against your code.

No other tool here does this

Grounded + verified, kept current

Every claim is checked against your code — and stays true.

Each statement cites a real file:line, is verified to exist before it's written, and is re-checked as the code changes. The others snapshot prose and go stale at the next crawl; DeepInit writes verified truth and maintains it.

Others snapshot — DeepInit verifies & maintains

What matters	DeepInit	/init (Claude Code)	Starter-file generators	Understand-Anything	GitNexus	Google Code Wiki	DeepWiki
Approach	Analyze once → write	Quick scan	Scaffold a stub	Graph to explore	Graph to query	Wiki to ask	Wiki to ask
License & cost	Free · MIT	Built in	Free / open source	Free · MIT	Paid for commercial use	Free public · cloud	Paid for private · cloud
Runs	On your machine	In the agent	Local	Local	Local	In the cloud	In the cloud
What it produces	Context files your agent reads	One context file	Context files	A graph + dashboard	A code graph	A hosted wiki	A hosted wiki
How it reads your code	Real parsing + AI, checked	Quick AI read	File scan + AI	Parsing + AI	Code graph	AI over the repo	AI over the repo
Your business rules	✓ written & ranked	—	—	~ a domain view	—	~ in prose	~ in prose
The "why" behind the code	✓	—	—	—	—	~ inferred	~ inferred
Database vs. your code	✓ spots the drift	—	—	—	—	—	—
How features cross the code	✓ traced	—	—	~ dependencies	~ dependencies	~ in prose	~ in prose
Keeps only what helps	✓	—	— no filter	n/a	n/a	—	—
Traceable to the exact file & line	✓ every finding	—	—	~	~	~ links	~ links
Checked against your code	✓	—	—	—	—	—	—
Measured precision (false-positive rate)	✓ 0/22 on real bugfixes	—	—	—	—	—	—
Flags risky / single-owner code	✓	—	—	—	—	—	—
Small file + depth on demand	✓	— one file	~ tiered	n/a	n/a	n/a	n/a
Works with your agent	✓ standard files · Claude Code first-class	— its own agent	✓ standard files	✓ many	~ some editors	~ web / MCP	~ web / MCP
Cheap to keep updated	✓ only what changed	—	—	~	~	~ auto (cloud)	~ auto (cloud)
Stays on your machine	✓	via the agent	✓	✓	✓	— cloud	— cloud

✓ = does it · ~ = partial / adjacent · — = doesn't · "n/a" = not that kind of tool.

A closer look at the main alternatives

/init (Claude Code)

Claude Code · similar in Codex

Built into the agent

The zero-effort starting point: a quick structural overview — stack, layout, a few conventions. Shallow by design. No business rules, no database, no rationale, nothing verified — in a blind nine-repo head-to-head it grounded just 0.6% of its claims to a checkable file:line (DeepInit fast: 77.6%).

DeepInit: the depth upgrade for the same files — rules with criticality, live DB drift, the why — checked against the code.

Understand-Anything

multi-agent · knowledge-graph.json + dashboard

MIT · multi-platform

A good tool: a multi-agent pipeline turns a repo into a portable knowledge graph and a dashboard you explore. Built for exploring structure, not briefing an agent up front. A graph, not a context file.

DeepInit: writes compiled, verified meaning into the files the agent already loads — instead of a graph to query. Complementary, not competing.

Google Code Wiki

Gemini · scans after every commit

Cloud · public free / private waitlist

Google's self-updating wiki: structured docs, diagrams, and a chat that points to line numbers. Excellent for human onboarding. A hosted, human-facing wiki — agent context isn't the goal.

DeepInit: 100% local and agent-facing — files the agent loads, not a hosted wiki — with the deep semantic layer, verification, and MIT.

Karpathy's "LLM Wiki"

a pattern · markdown · community code forks

Open pattern · built for documents

Not a product — Andrej Karpathy's popular pattern for turning documents into a self-maintaining markdown wiki an agent builds and you then query; community projects adapt it to code. Like the tools above, the output is a wiki you ask — not context the agent auto-loads.

DeepInit: the same instinct — precompute, don't re-derive — but aimed at a codebase, verified against the code, and written into the files your agent already reads.

Comparisons reflect publicly available information about third-party products as of June 2026. These tools evolve fast — details may be out of date. "Starter-file generators" are lightweight tools (Apify's generator, hcc, Intent and similar) that scaffold a short AGENTS.md — useful, but shallow by design; several deliberately cap themselves at ~20–30 lines. Product names and trademarks belong to their respective owners and are used for identification only; mention does not imply endorsement or affiliation. Something inaccurate? Open an issue and we'll fix it.

That drift check and the verified-to-the-line grounding are the difference. Point it at your repo:

Get started

Evidence & limits

Grounded in research — and honest about the limits.

DeepInit's design is a response to a measured, counterintuitive result: handing a coding agent a big, auto-generated context file makes it perform worse — and costs more to run. So DeepInit does the opposite of "document everything."

20%+

more cost per task — the penalty for bloat. Pile everything into one auto-generated context file and the model burns extra tokens on context it didn't need, for output that comes out worse, not better. The fix isn't more context, it's less: trim to only what the agent can't infer, and the result flips to a measured gain.

ETH Zurich / LogicStar · arXiv:2602.11988 · CC BY 4.0

0.6% → 77.6%

of claims tied to a checkable file:line, head-to-head against Claude Code’s built-in /init. On the same nine repositories — eight languages, small to large, some obscure — independent blind verifiers opened the real code and checked every claim in the lean CLAUDE.md both tools write: /init grounded 0.6%, DeepInit’s quick fast mode 77.6%. Same files, same front-door file — the difference is whether your agent gets a line it can actually open and trust.

DeepInit fast vs /init · 9 repos · 8 langs · 36 blind-scored outputs · INDICATIVE (mostly well-known OSS · fast mode · no wall-clock timing)

The honest trade-off — it costs more. Measured on Claude Opus in that same head-to-head: /init averaged ~5,700 output tokens per run; DeepInit fast averaged ~46,000 — roughly 8× the output, because DeepInit also writes the deep .ai/docs/ tier, not just the one file. (Input volume is dominated by discounted prompt-cache reads, so the bill isn’t proportional to raw input — the uncached input is tiny.) We’re holding the per-run dollar figure blank on purpose — the tokens are measured, the published price waits on one clean end-to-end accounting run; and because DeepInit runs in your own Claude session, on a subscription those tokens come out of your existing plan, not a separate per-run bill. You pay once for the depth; /deep-init:refresh re-reads only what changed and the 0-token /deep-init:check is free after.

0 / 22

false positives on real, human-merged bug fixes — it never re-flagged a line a maintainer had already fixed. Across four more real repos, a naive rule-checker would have fired ~90 false alarms; DeepInit fired none.

DeepInit’s own measurement, on real code · How we tested it

0 wrong

Run blind on 8 real repos with their own architecture docs removed, DeepInit never once confidently stated a fact the code disproves. That's the Mirror Test — does it actually understand your architecture? We strip a project's own docs (.NET, Rust, Python, Go), run on the code alone, and grade what it re-derived. Zero confidently-wrong facts — the same hard bar as the precision result above, applied to understanding.

What it re-derived (indicative): ~66% of what the human docs state, at 98% faithfulness — strongest on structure (components, dependencies, data stores), honestly weaker on deep invariants. Is it just memorizing famous repos? No: on two obscure repos a model is very unlikely to have memorized (one Go, one Rust) faithfulness held at 100% — so what it states about your unfamiliar code is just as trustworthy. The coverage number moves with how deep one pass goes; the trustworthiness doesn't.

The honest per-fact-kind picture — where it’s strong, and where the frontier is. On the held-out set it re-derives the structural facts well: component-exists 82%, component-role 84%, entry-point 88%, dependency-edge 73%, technology-choice 73%, data-store 75%. The current frontier is the hardest semantic “why” facts — boundary-rule 32% and key-invariant 29% — which we surface openly rather than bury; a frozen per-kind coverage floor keeps them from regressing as the spec moves.

DeepInit’s own measurement, docs removed · INDICATIVE (8 held-out repos; 2 contamination-resistant shown) · the precision result above stays the headline

What the research says

Across multiple agents and models, auto-generated context files didn’t generally improve task success and cost 20%+ more per task — yet with a project’s existing docs stripped away, trimming to only the non-obvious flipped it to a measured gain. That inversion is the whole design.Gloaguen et al., Evaluating AGENTS.md — ETH Zurich / LogicStar, Feb 2026 · arXiv:2602.11988 (CC BY 4.0)
Why bloat backfires: models lose what’s buried mid-context, and output quality degrades as input grows — well before the window is full.Liu et al., Lost in the Middle — Stanford / TACL 2024 (accuracy drops 30%+ when key info sits mid-context) · Chroma Research, Context Rot, 2025 (18 frontier models)
The problems DeepInit flags target the costly, hard-to-find faults — functional and integration bugs, and the business-logic flaws pattern-matching scanners miss.DeepInit design rationale — the full evidence trail, every decision traced to its source, lives in the repo
Our own precision is measured and exact — 0 of 22 false positives on real, human-merged bug fixes. The task-success upside is still being benchmarked; that number goes here when the runs finish, not before.How we tested it → · every figure self-derives from committed records

The limits

It reads your code — it doesn't run it.

DeepInit analyzes your source and inspects your live database schema (read-only). It doesn't execute your app, so purely runtime behavior — load, race conditions, anything that only surfaces live — is out of scope.

It flags likely problems — it won't prove them.

Findings are framed as likely, grounded to a file:line, with linter-territory suppressed. It is not a security product and makes no safety guarantee — access-control gaps are surfaced as risk to review, not proof your code is safe.

The first run is thorough — not instant.

A full analysis does real work: parsing, database checks, and multiple verification passes. Expect minutes and real tokens the first time, not seconds — after that, updates are incremental and cheap.

Measured precision, an architecture it re-derived from code alone, the limits stated plainly. That's the proof — run it on your own repo:

Get started

How we tested it

Engineered to be right — measured not to cry wolf.

A finding earns trust two ways: it is grounded in your actual code, and it is not noise. DeepInit is built for the first and measured for the second — on its own fixtures, on 22 real human-merged bug fixes, and on real open-source repos.

Right — grounded, verified, demonstrated

Grounded to the line, verified before written — Every claim cites a real file:line and is checked to exist against your code first — the difference between analysis and a confident guess.
It reasons about your code, not just patterns — On real repos it read legitimately-open endpoints as valid exceptions, diagnosed a documented rule that had gone de-facto stale, and re-found a fixed restic bug blind. On a full-pipeline analysis of the kemal web framework it raised zero issues — all nine structural near-candidates suppressed by a named guard rather than raised as a guess. Comprehension, not keyword-matching.
Tested like production software — A 450-check regression harness runs on every change; blind multi-lens runs and an independent oracle (a held-out answer key the runs are graded against — nothing to do with databases) keep the numbers honest as the model evolves.

Then the harder half — the false-positive rate, because a false alarm is what gets a tool turned off:

False positives 0/22 on real fixed bugs — Replaying 22 real human-merged bug fixes (3 of them CVEs), it never re-flagged a line already fixed.
0 false alarms where a naive scanner fires ~90 — Across four real repos with documented rules the guards held: about 90 naive false defects avoided, none raised.
A record of detectors we refused to ship — When a check would cry wolf, we leave it off — and say so.

These are DeepInit’s own measurements on real code, kept exact. Recall (the false-negative side) on those real bug fixes was 14/22 — labeled indicative, below our own ship-gate, not a headline — and the speedup benefit is still being benchmarked, so this page leaves it blank rather than invent one.

The five layers — and the limit of each

Trust comes from stating what each test proves and what it can’t. In rough order of independence:

A 450-check regression harness across 99 oracle sections (deterministic, no model) — proves the engine’s mechanical logic stays correct on every change: dependency ordering, the cost math, the issue oracles, the census arithmetic, SARIF shape, and the scoring that grades the blind runs below. Doesn’t prove the model-driven findings are right — it grounds and scores fixtures, it doesn’t run the live analysis.
Blind runs on our own fixtures — multiple independent lenses, each blind to the answer key, agree: recall 9/9, false positives 0. Doesn’t prove real-world recall — these fixtures were designed alongside the spec, so 9/9 over-states what you’d see on unseen code. We say so.
An independent oracle on real fixed bugs — 22 human-merged bug fixes across 22 repos and 4 languages (3 of them CVEs). The hard, gated result: it never re-flagged a line a maintainer had already fixed (0/22). Doesn’t prove a headline recall number — recall here was 14/22, small-sample and below our own ship bar, so it stays an indicative sub-line, never a claim.
Naive-vs-guarded precision — on four real repos carrying a documented rule, a naive “mismatch = violation” checker would have fired ~90 false alarms; the guarded detector raised none, every census signal arithmetically correct. Doesn’t prove those repos had bugs — they’re clean; this measures precision and scope-honesty, not recall.
A record of the checks we refused to ship — where a check’s decidable rule couldn’t be told apart from its real-defect rule (it would cry wolf), we measured it, left it off, and wrote down why. Doesn’t prove those problem classes are undetectable — each deferral names exactly what would unblock it.

This is the harness — not the model. Together, those five layers are the harness: the engineered apparatus around the model that makes its understanding of your code trustworthy. It is the part a weekend prompt doesn’t have. A prompt hands you one ungrounded guess; the harness grounds every claim to a file:line, measures its own false-alarm rate, and is regression-tested on every change — so it doesn’t quietly drift as the model moves underneath it.

Run at the right depth — four tiers, not the whole suite every time

Maturity isn’t running every test on every change — it’s running the right depth for the change in front of you. The suite is tiered, so a typo fix doesn’t pay for a release-grade sweep:

L0 · smoke

Seconds. Runs on every edit — the fast tripwire.

L1 · gate

Only the mutations whose target files changed.

L2 · full

The whole mutation sweep, drift checks, and the public harness — every release.

L3 · deep

Metered real-engine runs — only when the engine’s output could actually move.

Seconds for a doc fix; the full mutation sweep for a release; token-spending real-engine runs only when they’d change the answer — a cost discipline a one-shot prompt has no way to make.

How we keep it honest — the techniques, not just the count

A big test count is easy to fake. What makes the 450 checks across 99 oracle sections trustworthy is how they’re built — six disciplines borrowed from how you’d test core software, not a prompt:

🔁

The answer key is the maintainer’s patch

Metamorphic bug-fix replay. We replay 22 real, human-merged bug fixes (3 of them CVEs): the detector must flag the broken commit and must not re-flag the fixed one. The ground truth predates our spec, so nothing can be over-fitted to it — and re-flagging an already-fixed line came up 0 times (0/22).

🙈

The grader never tells the engine the answer

Blind, separated duties. The engine emits its findings before anyone sees the key; an independent party pins the held-out answers; a third scores. In the Mirror Test the reference doc is provably removed from the inputs and the commit pinned by hash — so it reconstructs your architecture from code alone, never from a doc it quietly read.

🧬

We test that our tests would catch a regression

Mutation meta-testing. A meta-harness makes one known-bad edit at a time to a committed fixture and demands the suite go red. 166 of 166 mutations killed, 0 survived — proof the checks are load-bearing, not decorative. A check that can’t catch a planted bug proves nothing.

🔒

A model change can’t quietly rot a number

Frozen baselines + a drift guard. The measured zeros (no confidently-wrong HIGH facts on our test set, zero false defects, zero re-flagged fixes) are re-asserted on every change. Every figure on this page is derived from committed records by one aggregator — no number is hand-typed, and each derived figure must recompute from its own inputs, so a stale one fails the build. Not theoretical: a round-trip check recently caught a stale figure in our own cost model — our gates caught our own mistake.

🧪

We test the product, not just the Python

A two-tier integration framework. The deterministic harness checks the reference logic; on top of it an env-guarded metered runner drives the real engine on a pinned corpus, and a separate 0-token auditor re-derives that run’s own claimed numbers — coverage, citation-resolution, the zero-wrong-HIGH result — from its committed artifacts and demands they reproduce. The run can’t grade its own homework.

📉

A quality score that can’t silently slip

A feedback loop with frozen floors. Every run rolls up into one quality scorecard whose hard line is zero confidently-wrong HIGH-severity facts on our held-out test set; per-kind coverage carries frozen floors a spec edit can’t silently drop below; and a 0-token replay oracle re-scores past runs in CI — so a change that moved coverage shows up as a failing check, not a surprise.

The 450 checks run with no model in the loop (deterministic) and grow only by addition — the original engine checks must never regress. Recall is reported, never gated, and kept below the headline because our own fixtures over-state it; precision (the false-alarm side) is what we gate on.

Field-validated across 15 language stacks (of the 25 its parser supports)

The same detectors, run over ~1.12M lines of real open-source code in 15 languages — Go, Rust, C, C++, Java, C#, Kotlin, PHP, Ruby/Rails, Elixir, OCaml, Swift, Python, TypeScript, Crystal — comprehend each language’s own structure rather than matching a surface pattern. The clearest example is circular dependencies: the identical check stays silent where the language forbids cycles (Go packages, Rust crates, C# assemblies, and the OCaml and Swift build manifests — verified by building the real dependency graph), fires on a genuine one where the language permits it (a 33-package cycle in a Java backend, a 31-component cycle across a PHP framework’s separately-published packages, a 14-package cycle in a Kotlin HTTP client, two namespace cycles inside a C# media server, and a real cycle in nginx’s foundational C core), and reads the subtler cases right: on Elixir it separates the compile-time graph the compiler keeps acyclic from the runtime cycle it allows, and on the C/C++ pair — the same #include model — it fires on nginx yet stays silent on a strictly-layered C++ library, because that regime permits cycles without requiring them. Sharper still: the same swallowed-error check is inapplicable on pure C (which has no exception construct to match) yet correctly re-activates on C++ — same check, different language. Each fire was independently re-computed a second way before it was trusted.

Cycle regime	What the language does	Field witnesses	The same IF-8 check…
Hard ban	A cross-component cycle is a compile/build error — structurally impossible.	Go, Rust, C#, OCaml, Swift	— stays silent (0 cycles)
Partial ban	Compile-time cycles banned; runtime call-cycles permitted.	Elixir (Phoenix)	— silent compile-time · ✓ fires at runtime
Permitted, explicit	No ban; dependencies are explicit `import` statements — fully groundable.	Java, PHP, Kotlin, C# namespaces, TypeScript	✓ fires — a real SCC
Permitted, textual	No module system; `#include` + guards let a cyclic include graph compile.	C (nginx) · C++ (Poco)	✓ nginx · — Poco
Permitted, hidden	Dependencies are implicit (autoloaded constant refs) — below an import-grep substrate.	Ruby (Rails / Zeitwerk)	declines to fabricate — honest gap, never a false alarm

Same check, five compiler regimes. It builds the real dependency graph for each language's actual unit of modularity, then runs a genuine cycle search — so it fires where cycles are real and permitted, and stays silent where the language forbids them. The C/C++ row is the proof: identical #include model, nginx firing a real triad while a strictly-layered C++ library stays clean.

Every one of these is recorded as a structural observation, never filed as a bug or published — and these are direct detector sweeps, not the full graded pipeline. The point is comprehension across the ecosystem, not a bug count on famous repos.

Where Graphify ends and the understanding begins. Graphify is the AST extractor that sharpens the structural graph on 25 languages — it resolves an import to the file that defines the symbol, which a grep can’t. But it’s an accelerant, not the engine. A stack with no grammar (Crystal, OCaml) automatically falls through to a ctags/grep import graph that still captures roughly 80% of cross-component imports — the run just carries lower certainty, and the rule is always degrade, don’t false-flag. The proof it isn’t a crutch: most of these 15 field sweeps were run on the grep fallback, before Graphify was wired in as the default — so the comprehension above stands on its own. The language reasoning — knowing a cycle is even possible in this language before flagging one — is DeepInit’s, layered on top of whatever parser is available.

Tested across a language × size matrix

Beyond the sweeps, the full pipeline is run on a deliberate matrix of 16 leading repositories — 13 languages × three size tiers, each pinned to a commit and measured, not cherry-picked. The kinds of project span how real codebases actually differ: web frameworks (gin/Go, express/JS, sinatra/Ruby, laravel/PHP at 330k lines, phoenix/Elixir), libraries & CLIs (click, gorilla/mux, itsdangerous, fmt/C++, a Kotlin schema lib, uniffi-rs/Rust), and larger apps, data stores & SDKs (redis/C at 346k lines, excalidraw/TypeScript at 157k, pyccel transpiler, a commercetools Java SDK). 15 of the 16 parse on the designed AST path; the one that doesn’t (Crystal, no grammar) proves the grep-fallback degradation path end-to-end.

Measured per size tier — and it tells you before it spends

DeepInit estimates the cost up front, before a token is spent. We measured the real output across the matrix to keep that estimate honest: a small library runs around 150–160k tokens, a medium framework 80–230k, and a large 100k-line transpiler about 200k — one thorough pass, then incremental updates that re-read only what changed. We’re holding the dollar figure blank on purpose: the token counts are measured, but a published price waits on one clean end-to-end accounting run rather than an estimate. (INDICATIVE; Claude Opus pricing as of June 2026; re-derivable from committed records.)

Why real understanding beats “just send it to an LLM”

We measured it. On three of those repos we ran the analysis three ways — the full designed path (AST + grounding + verification), the grep fallback, and a naive LLM-only baseline (the controlled stand-in for “dump the code into a model and ask for docs”, no structural parse / no grounding / no verification) — and scored each, blind, against the parsed graph and the real code. The full path grounded ~99% of its claims to a verified file:line; the naive baseline grounded ~44% (and 0% on one repo — it cites filenames, not lines), inflated the dependency graph with edges that aren’t imports, and surfaced none of the grounded security-relevant findings the verified paths did. This is a different measurement from the /init head-to-head above: that one pits DeepInit’s fast mode against /init across nine mixed repositories (the 0.6% → 77.6% grounding gap); this one pits DeepInit’s full mode against a raw-LLM baseline on three famous repos (~44% → ~99%). Different modes, different repos, different baselines — the ~99% here and the 77.6% above are not the same number, and shouldn’t be averaged. The honest part: on these famous repos every mode was ~99–100% faithful (the model knows them) — so the real difference isn’t hallucination, it’s whether you can trust which line a claim refers to, and whether the gaps get caught. That is the whole thesis: a prompt gives you a plausible description; the harness gives you a grounded, verified one.

All figures here are DeepInit’s own, INDICATIVE (small-n, repo@SHA-grounded, re-derivable from committed records); the precision result above stays the headline. We also run DeepInit on our own tools — an independent review of a dogfood run on our plugin returned “would use”, 10 of 11 spot-checked claims correct, every hard count exact.

Who & when

Run it the moment a codebase outgrows what an agent can infer on its own.

🏚️

You inherited a legacy / under-documented repo

Point DeepInit at it and get a grounded map in minutes: the architecture, the components, and the non-obvious rules a new engineer (or agent) would trip over — every claim cited to a file:line you can open.

🗺️

You’re in a large, fast-moving codebase

Even code your agent helped write drifts — rules accrete, modules start sharing state, the live schema moves, and a single hand-written CLAUDE.md goes stale. DeepInit keeps a grounded, current CLAUDE.md and refreshes only what changed (/deep-init:refresh / /deep-init:check) — so the agent works from what the code is now, not what it was when you last wrote it down.

🤖

You’re onboarding a coding agent

The lean CLAUDE.md (and the AGENTS.md / Copilot / Cursor / Windsurf projections) give the agent the load-bearing context up front — so it stops guessing the conventions and architecture on every task.

🔧

You're about to refactor

Before you move things, see the key invariants, the boundary rules, and the hidden couplings (the same drift / contradiction / circular-dependency families it reports) — so the refactor doesn't quietly break an unwritten rule.

Straight answers

The four questions everyone asks.

Isn't this just /init?

No. /init writes a quick starter file from a one-pass read. DeepInit parses your code first (real AST via Graphify, with a graceful fallback), grounds every claim to a verified file:line, separates a lean always-loaded tier from a deep on-demand one, and reports the problems it finds — and it's regression-tested so it doesn't quietly drift as the model changes. The comparison table above lays out the difference column by column.

Does it touch my source code?

No. It is report-only. It writes documentation (a CLAUDE.md owned region and an .ai/ folder) and never edits your source. It writes a .bak before touching any existing file and preserves human-authored content byte-for-byte.

Will it leak my code?

No — it's 100% local. The skill declares no network tool; there is no egress path (we gate that as a test). Parsing and analysis run on your machine in your existing agent session. Secrets and PII are redacted before anything is written, and the report it generates opens offline with zero network calls.

What does it cost to run?

It runs in your own agent session, so the cost is the model tokens of one analysis pass — no subscription, no API key for the parser. A small repo is an inexpensive single pass; a large one costs more. We're finishing a clean per-tier benchmark before publishing a dollar figure rather than inventing one.

Install & activate

Install the plugin once. Then run /deep-init.

DeepInit ships as a Claude Code plugin. To be clear, “plugin” and “skill” aren’t alternatives — the plugin is just the delivery package, and the /deep-init skill that does the work lives inside it. Installing the plugin is how you get the skill; after that, you run /deep-init in any project. No subscriptions, no API keys, no servers.

1 · Add the marketplace, install the plugin

/plugin marketplace add deepfusionlabs/deep-init
/plugin install deep-init@deepfusionlabs-deep-init

These are slash commands you type into the Claude Code chat (not your terminal). They install from DeepFusion Labs’ plugin marketplace on GitHub.

2 · Reload so the plugin loads

Claude Code reads plugin commands only at startup, so activation depends on where you run it. In a plain terminal, start a new session (or run /reload-plugins). Inside VS Code or JetBrains, restart the IDE itself — Developer: Reload Window does not reload the plugin host, so a window reload won’t pick it up.

3 · Run it, in any project

/deep-init      # zero config — the full, thorough analysis (2 review cycles). The whole getting-started.

Updating later

/deep-init:plugin-update   # pulls the newest version on one confirm, then guides the reload

Requirements: Claude Code — that’s the only prerequisite. The repo is public, so the two commands above work for anyone: add the marketplace, install the plugin, done.

Stays fresh on its own — honestly. The plugin ships a 0 token freshness check on two events — when you open a session, and on the first prompt of a stale one — and offers a one-click refresh that shows exactly what drifted. It never spends a token to detect, never updates without your click, and the off-switch is one line.
Safe on existing files. If your repo already has a CLAUDE.md (or AGENTS.md), DeepInit backs it up first and preserves anything inside your keep-markers — your hand-written notes survive.

More control, when you want it

A bare /deep-init is the zero-friction default. Beyond it, a curated menu of slash commands — no flags to memorize, nothing to mistype:

You want to…	What it does	Command
The full run (default)	2 adversarial review cycles — plus a 3rd automatically when the analysis isn’t yet clean	`/deep-init`
A quick first pass	Skips the review cycles — faster and cheaper	`/deep-init:fast`
Refresh only what changed	Incremental re-analysis of the touched components + issue lifecycle	`/deep-init:refresh`
Check it’s still true	0-token staleness + broken-citation audit (no model call)	`/deep-init:check`
Tune the run with buttons	Depth · issues · outputs · scope · cost — a native picker, no typing	`/deep-init:customize`
Translate the report	Emit `report.<lang>.html` — Spanish & Hebrew built in, any other language on demand (English stays canonical)	`/deep-init:translate`
Preflight (0 tokens)	Tools, scope, resolved config, families, cost estimate	`/deep-init:doctor`
Which version is running	Loaded vs on-disk — tells you if you need to reload	`/deep-init:version`
Update the plugin	Pull the newest from the marketplace, on one confirm	`/deep-init:plugin-update`
Every command + option	Grouped and ordered by how often you’ll use it	`/deep-init:help`

Type-safe by design — no flags to memorize. Each option lives where it costs you least: a command for the common dials, a button picker (/deep-init:customize) for the rest, and a JSON-Schema-validated .ai/deepinit.config your editor autocompletes and checks before you run. The literal flags and natural language still work for power users and CI.

It even knows its own version. Claude Code loads a plugin’s commands once per session, so after an update you can’t normally tell what’s actually running — /deep-init:version compares the loaded version against what’s on disk and tells you if you need to reload, and /deep-init:plugin-update pulls the newest in one confirm. Most plugins can’t tell you what’s live.

Zero setup: DeepInit checks for its one dependency (scc) and installs it for you if it’s missing. Graphify gives richer parsing and installs the same way — optional; if you skip it during setup, DeepInit falls back to ctags/grep.

Your agent can see every file — and still misses what matters.

Lean, verified context — and the problems it finds, in the same pass.

Seven waves. Parallel. Adversarial. Thorough.

Preflight no tokens

Discovery no tokens

Vertical analysis parallel · heaviest

Horizontal analysis parallel

Cross-references

Filter, verify & review the gates

Generation

Inside waves 3 & 4 — go deep, then go wide

Your database, and what stays private

The philosophy: tokens are fuel, not a constraint.

And counterintuitively, it costs you less.

Run it once — then it stays current, proportional to the diff.

Detect 0 tokens · deterministic

Mark the dirty set

Skip safely — the interface-hash test the cost saver

Always re-run the whole-system docs the safety net

Re-emit only the affected files

A real interface change can’t silently skip a dependent

A removed file never leaves an orphaned doc

And it tells you — for free — the moment the docs fall behind.

The problems fall out as a byproduct — grounded, ranked, report-only.

Your database has drifted from your code

The code contradicts a decision you made

A rule is enforced in some places, not others

Two components are secretly coupled

Where to look first

One report you can read — and act on.

One report — Docs · Insights · Map

Read it in your language

Risk, ranked — with the graph

Shows up in your tools

What it actually writes down.

How every “✓ checked” is earned

Wikis and graphs give you a place to ask. DeepInit puts the answer where your agent already looks.

It spots the drift nobody else looks for.

Every claim is checked against your code — and stays true.

A closer look at the main alternatives

Grounded in research — and honest about the limits.

What the research says

The limits

It reads your code — it doesn't run it.

It flags likely problems — it won't prove them.

The first run is thorough — not instant.

Engineered to be right — measured not to cry wolf.

Right — grounded, verified, demonstrated

The five layers — and the limit of each

Run at the right depth — four tiers, not the whole suite every time

How we keep it honest — the techniques, not just the count

The answer key is the maintainer’s patch

The grader never tells the engine the answer

We test that our tests would catch a regression

A model change can’t quietly rot a number

We test the product, not just the Python

A quality score that can’t silently slip

Field-validated across 15 language stacks (of the 25 its parser supports)

Tested across a language × size matrix

Measured per size tier — and it tells you before it spends

Why real understanding beats “just send it to an LLM”

An instruction-defined engine — with real algorithms under the markdown.

The algorithms a one-file linter can’t reproduce

Circular dependencies, the whole graph

Dead branches across file boundaries

The same list that quietly disagrees

The decision log — we measured what not to ship

On a whole-system graph

Field-validated, not bench-only

Regression-tested on every change

Run it the moment a codebase outgrows what an agent can infer on its own.

You inherited a legacy / under-documented repo

You’re in a large, fast-moving codebase

You’re onboarding a coding agent

You're about to refactor

The four questions everyone asks.

Install the plugin once. Then run /deep-init.

1 · Add the marketplace, install the plugin

2 · Reload so the plugin loads

3 · Run it, in any project