BELIEVABILITY · ARCHITECTURE, NOT PROMPT

A finding can't post unless it earns it.

The field's pipelines are two stages: a model speaks, and what it says gets posted. Ours is five — run by a pantheon of five specialists and disciplined at every seam. Believability isn't a feature bolted on; it is the pipeline. Each gate is built, hardened, and measured one at a time. A better model narrows none of it.

THE FIELD · 2 STAGES
modelcomment

A diff goes in, a model speaks, what it says gets posted. The model is detector, judge, and witness at once.

SIGILIX · 5 STAGES
evidencemodelgatesreceiptsmemory

The deterministic layer is the witness; the model's job narrows to interpretation — the job models are actually good at.

ONE WEBHOOK · MANY PIPELINES

Not every event deserves the same pass.

A single GitHub or Linear event doesn't map to one action. A dispatcher routes each one to the pipeline it deserves — so a cheap PR overview lands in seconds instead of waiting behind the full five-specialist ensemble.

PR overview

A structured summary of the change, in seconds.

Ensemble review

The deep five-specialist pass with every gate.

CI-failure triage

Reads the job logs + diff to a grounded root cause.

Issue triage

A vague Linear/Jira ticket to priority, estimate, the exact failure path.

Describe · Improve

On a slash-command — rewrite the PR body, or apply a fix.

@mention Q&A

Ask in a PR, issue, or Slack thread — answered, grounded in the repo.

BEFORE JUDGMENT · REPO-AWARE

It reviews the diff in the context of the whole repo.

Diff-only review misses what the diff touches. Before any specialist runs, Sigilix expands context through the code graph — the callers, the dependencies, the symbols a change ripples into — alongside the index and everything prior reviews already learned. The specialists judge the change against the codebase, not a 200-line window.

Code graph

Walks callers, imports, and dependents — so a change is judged by what it actually affects.

Symbol-aware retrieval

Pulls the exact definitions and usages a finding depends on, not a fuzzy keyword grab.

Review memory

The conventions and verdicts this repo already taught it, carried into the new review.

THE PANTHEON · FOUR + ONE

Five specialists run the review.

A single generalist model hopes. The pantheon divides the surface area no one model covers alone — each tuned for one class of failure, each bound by a contract that turns plausible-sounding fabrication into a missing required field.

LOGIC SPECIALIST

Metis

Dead code, naming violations, logic errors a type-checker can't see, unreachable branches, and the test-coverage gaps that hide them. Metis reads intent, not just syntax.

checkout.ts:45 · proof: grounded
[Metis] unreachable branch — early return bypasses tax calc when total < 0. Remove or handle as an error path.
contract · cites the exact line & branch it claims
SECURITY SPECIALIST

Argus

Secret leakage, unsanitized inputs, auth bypasses, SSRF, insecure regexes, and OWASP-relevant patterns. Nothing crosses Argus without a citation.

fetcher.ts:71 · CWE-918 · proof: verified
[Argus] potential SSRF — user-supplied URL passed to fetch() with no allowlist. Validate against approvedHosts[].
contract · high severity carries a CWE id + manifest evidence id
PERFORMANCE SPECIALIST

Iris

N+1 queries, unnecessary re-renders, memory leaks, unbounded recursion, and Big-O regressions. Iris finds the hot path before production does.

Table.tsx:112 · O(n²) · proof: grounded
[Iris] sorting inside render() scales with row count. Memoize with useMemo keyed by sortKey.
contract · names a concrete hot path or states a complexity estimate
TESTS SPECIALIST

Eunomia

Missing edge-case coverage, flaky assertions, untested failure paths, and brittle fixtures. Eunomia holds the diff to the contract it claims to satisfy.

run_knip.sh · tests: 2 inspected
[Eunomia] no test simulates a scan failure — partial findings could be dropped silently. Add a case to knip_workflow_test.py.
contract · names the exact test files inspected
THE SYNTHESIZER

Harmonia

Harmonia doesn't find bugs — he resolves the noise of the other four. He deduplicates overlapping findings, suppresses false positives by cross-reference, ranks by merge-blocking impact, and writes the single comment you actually read. And he is bound in the opposite direction: he may drop a high-severity specialist finding only by quoting a proof-like contradiction from the evidence — a generic “looks fine” is restored deterministically by the floor.

unified 6 → 1severity ↓ warning1 comment · grounded
THE FIVE-STAGE PIPELINE

The architecture that disciplines them.

The specialists never get the last word. Believability is enforced at five seams — before the model, around it, after it, at the surface, and over time.

01
BEFORE THE MODEL

Deterministic evidence first.

No specialist is the first authority on what a diff contains. Before any model runs, a deterministic pass computes the evidence the models will be held to — secret scans, dependency-vulnerability lookups, AST rule hits, static-analysis findings — assembled into per-review manifests. A specialist can't assert a credential is exposed in the abstract; it must cite the manifest.

secret scandep-vulnAST rulesstatic analysis
02
AROUND THE MODEL

Provenance contracts.

Synthesis is where hallucination would enter, so it's bound by structural contracts, not instructions. Every synthesized finding must cite the identifiers of the real specialist findings it derives from, and a structural check drops any finding whose citations don't resolve. The synthesizer can't introduce a finding no specialist produced — a fabrication has nothing to cite. This is why the hallucinated-finding count is zero, not merely low.

security · high → requires CWE id + manifest evidence id
tests → must name the exact test files inspected
performance → concrete hot path or complexity estimate
03
AFTER THE MODEL, BEFORE POSTING

Refutation, then execution.

Each specialist must try to refute its own candidates against the code; a candidate cleared this way is recorded with the exact mitigating token quoted from source. Then findings that admit a runtime check are actually run — the model's opinion loses to the program's behavior. Execution receipts are cryptographically bound to (repo, commit SHA, run id, attempt) with single-use enforcement, so a receipt can't be replayed or forged.

self-refuteexecute & demotesingle-use receipts
04
AFTER POSTING

Receipts the reader can see.

The evidentiary class of every claim is surfaced, not buried. Each finding renders a proof tier, so you know per finding what kind of thing you're being asked to believe. Receipts, not vibes — in the product surface itself.

VERIFIED
execution-backed
GROUNDED
carried by deterministic evidence
MODEL
model judgment alone
05
AFTER, OVER TIME

Memory you train by talking to it.

Teach it in plain language — reply to a finding with “we use integer cents here” and Sigilix records the rule, applies it judiciously in later reviews, and attributes it inline (“applied because of a learned rule”) so the team sees why a call was made.

Every dismissal a team issues feeds a per-repository corpus, and recurring rejected finding classes are softened in later reviews. A false-positive endpoint tracks believability in production, making it a measured, regression-tested property — not an aspiration. It's the only part of the product a customer trains simply by disagreeing with it.

THE BYPRODUCT · EARNED CONTEXT

Every review deposits a layer the rest of the product spends.

The loop doesn't post a comment and forget. Each pass deposits a verified, machine-usable understanding of the repo — and the Sigilix CLI agent, Deep-Research Chat, and Triage all draw on it instead of re-crawling the codebase from cold. The believability you can't fake is the same substrate they run on.

Index

Vector + lexical, kept current as reviews flow.

Code graph

Which symbols call which, what depends on what.

Trust ledger

What was verified real, what the team dismissed.

Review memory

Conventions, past verdicts, learned per-repo rules.

Evidence manifests

Where security was probed, with the receipts.

A position, not a feature.

The 45–74% false-positive band the field records is the visible symptom of pipelines with none of these layers. A better model narrows nothing, because the layers aren't made of model. These gates can cost recall when mistuned — so every new gate ships shadow-first, and the harness measures exactly that trade at the seam where it occurs.

See what it measures →