Benchmarks

Benchmarks worth trusting.

We will not show you a 97% accuracy banner with an asterisk. Real benchmarks take time. While we run our evaluation pipeline against open-source repositories and private shadow-mode telemetry, here is exactly what we measure, how we measure it, and when we will publish.

01Dataset

What we test against.

Our evaluation uses three data sources. First, a stratified sample of 50 high-velocity open-source repositories (multi-language, >1000 commits, active issue trackers). Second, synthetic vulnerability injection using OWASP-style test cases where ground truth is known. Third, historical bug-fix commits — we know the bug was real because it was later patched and linked to a CVE or issue.

Repositories

Multi-language, >1000 commits, active maintainers.

OWASP

Synthetic injection

Known-truth vulnerabilities planted into test PRs.

Known-bug

Historical commits

Patched bugs from real CVEs and issue trackers.

02Metrics

What "better" actually means.

We measure four dimensions. Precision: of findings that are technically valid, what percentage are correct? Recall: of known bugs in the dataset, what percentage did we catch? Noise rate: false positives per 100 lines changed. Actionability: does the comment include a file, line, and a suggested fix, or is it vague advice?

Precision

Of findings flagged as valid, what fraction is technically correct? Measured by independent annotation against known truth.

Recall

Of known bugs in the dataset, what fraction did Sigilix surface? Caught vs. missed, with explicit per-class breakdown.

Noise rate

False positives per 100 lines changed. The single most important metric for whether reviewers turn off the bot.

Actionability

Does each finding name a file, a line, and a concrete fix — or is it vague advice the author can't act on?

03Baseline

Controlled comparison.

We compare Sigilix against single-agent GPT-4o (same temperature, same context-window limit) and human review latency. We disclose our full testing conditions: prompt versions, model snapshots, retrieval depth, temperature settings. When we publish, you will be able to reproduce our results — or prove us wrong.

Dimension	Sigilix	Single-agent GPT-4o	Human baseline
Precision	Q3 audit	TBD	TBD
Recall	Q3 audit	TBD	TBD
Noise rate	Q3 audit	TBD	TBD
Actionability	Q3 audit	TBD	TBD
Latency	Q3 audit	TBD	TBD

All cells return numbers when the Q3 2026 audit publishes. We will not invent placeholders.

04Live preview

Internal telemetry preview.

These are unverified internal metrics from our shadow-mode pipeline over the last 30 days. We are sharing them because hiding everything feels worse than sharing honestly. Treat them as directional, not scientific.

Reviews processed

0.0

Avg findings per PR

Hallucination rate

Down from 34% in v0.9

Customer code retained

These numbers are not audited. They are generated from our own pipeline logs over the trailing 30-day window. The true benchmark is coming — see Section 05 for the timeline.

05Commitment

When you can trust the scoreboard.

We will publish a third-party audit of the Sigilix Benchmark Protocol by Q3 2026. The audit will cover precision, recall, and noise on the datasets described above, evaluated by an independent ML engineering firm. If you want the report when it drops, join the list. If you think our methodology is flawed, email us — we will update this page.

Get the audit report

We'll email the published audit when it lands — and only that. No newsletter, no drip campaign.

Notify me

Honest measurement is part of the product.

If we promised numbers we couldn't defend, we'd burn the trust we're trying to earn.

Read our security architecture See the architecture →

← Examples Security →

Last updated 2026-05-04