A finding can't post unless it earns it.
The field's pipelines are two stages: a model speaks, and what it says gets posted. Ours is five — run by a pantheon of five specialists and disciplined at every seam. Believability isn't a feature bolted on; it is the pipeline. Each gate is built, hardened, and measured one at a time. A better model narrows none of it.
A diff goes in, a model speaks, what it says gets posted. The model is detector, judge, and witness at once.
The deterministic layer is the witness; the model's job narrows to interpretation — the job models are actually good at.
Not every event deserves the same pass.
A single GitHub or Linear event doesn't map to one action. A dispatcher routes each one to the pipeline it deserves — so a cheap PR overview lands in seconds instead of waiting behind the full five-specialist ensemble.
A structured summary of the change, in seconds.
The deep five-specialist pass with every gate.
Reads the job logs + diff to a grounded root cause.
A vague Linear/Jira ticket to priority, estimate, the exact failure path.
On a slash-command — rewrite the PR body, or apply a fix.
Ask in a PR, issue, or Slack thread — answered, grounded in the repo.
It reviews the diff in the context of the whole repo.
Diff-only review misses what the diff touches. Before any specialist runs, Sigilix expands context through the code graph — the callers, the dependencies, the symbols a change ripples into — alongside the index and everything prior reviews already learned. The specialists judge the change against the codebase, not a 200-line window.
Walks callers, imports, and dependents — so a change is judged by what it actually affects.
Pulls the exact definitions and usages a finding depends on, not a fuzzy keyword grab.
The conventions and verdicts this repo already taught it, carried into the new review.
Five specialists run the review.
A single generalist model hopes. The pantheon divides the surface area no one model covers alone — each tuned for one class of failure, each bound by a contract that turns plausible-sounding fabrication into a missing required field.
Metis
Dead code, naming violations, logic errors a type-checker can't see, unreachable branches, and the test-coverage gaps that hide them. Metis reads intent, not just syntax.
Argus
Secret leakage, unsanitized inputs, auth bypasses, SSRF, insecure regexes, and OWASP-relevant patterns. Nothing crosses Argus without a citation.
Iris
N+1 queries, unnecessary re-renders, memory leaks, unbounded recursion, and Big-O regressions. Iris finds the hot path before production does.
Eunomia
Missing edge-case coverage, flaky assertions, untested failure paths, and brittle fixtures. Eunomia holds the diff to the contract it claims to satisfy.
Harmonia
Harmonia doesn't find bugs — he resolves the noise of the other four. He deduplicates overlapping findings, suppresses false positives by cross-reference, ranks by merge-blocking impact, and writes the single comment you actually read. And he is bound in the opposite direction: he may drop a high-severity specialist finding only by quoting a proof-like contradiction from the evidence — a generic “looks fine” is restored deterministically by the floor.
The architecture that disciplines them.
The specialists never get the last word. Believability is enforced at five seams — before the model, around it, after it, at the surface, and over time.
Deterministic evidence first.
No specialist is the first authority on what a diff contains. Before any model runs, a deterministic pass computes the evidence the models will be held to — secret scans, dependency-vulnerability lookups, AST rule hits, static-analysis findings — assembled into per-review manifests. A specialist can't assert a credential is exposed in the abstract; it must cite the manifest.
Provenance contracts.
Synthesis is where hallucination would enter, so it's bound by structural contracts, not instructions. Every synthesized finding must cite the identifiers of the real specialist findings it derives from, and a structural check drops any finding whose citations don't resolve. The synthesizer can't introduce a finding no specialist produced — a fabrication has nothing to cite. This is why the hallucinated-finding count is zero, not merely low.
tests → must name the exact test files inspected
performance → concrete hot path or complexity estimate
Refutation, then execution.
Each specialist must try to refute its own candidates against the code; a candidate cleared this way is recorded with the exact mitigating token quoted from source. Then findings that admit a runtime check are actually run — the model's opinion loses to the program's behavior. Execution receipts are cryptographically bound to (repo, commit SHA, run id, attempt) with single-use enforcement, so a receipt can't be replayed or forged.
Receipts the reader can see.
The evidentiary class of every claim is surfaced, not buried. Each finding renders a proof tier, so you know per finding what kind of thing you're being asked to believe. Receipts, not vibes — in the product surface itself.
Memory you train by talking to it.
Teach it in plain language — reply to a finding with “we use integer cents here” and Sigilix records the rule, applies it judiciously in later reviews, and attributes it inline (“applied because of a learned rule”) so the team sees why a call was made.
Every dismissal a team issues feeds a per-repository corpus, and recurring rejected finding classes are softened in later reviews. A false-positive endpoint tracks believability in production, making it a measured, regression-tested property — not an aspiration. It's the only part of the product a customer trains simply by disagreeing with it.
Every review deposits a layer the rest of the product spends.
The loop doesn't post a comment and forget. Each pass deposits a verified, machine-usable understanding of the repo — and the Sigilix CLI agent, Deep-Research Chat, and Triage all draw on it instead of re-crawling the codebase from cold. The believability you can't fake is the same substrate they run on.
Vector + lexical, kept current as reviews flow.
Which symbols call which, what depends on what.
What was verified real, what the team dismissed.
Conventions, past verdicts, learned per-repo rules.
Where security was probed, with the receipts.
A position, not a feature.
The 45–74% false-positive band the field records is the visible symptom of pipelines with none of these layers. A better model narrows nothing, because the layers aren't made of model. These gates can cost recall when mistuned — so every new gate ships shadow-first, and the harness measures exactly that trade at the seam where it occurs.
See what it measures →