Benchmarks
Benchmarks worth trusting.
We will not show you a 97% accuracy banner with an asterisk. Real benchmarks take time. While we run our evaluation pipeline against open-source repositories and private shadow-mode telemetry, here is exactly what we measure, how we measure it, and when we will publish.
What we test against.
50
Repositories
Multi-language, >1000 commits, active maintainers.
OWASP
Synthetic injection
Known-truth vulnerabilities planted into test PRs.
Known-bug
Historical commits
Patched bugs from real CVEs and issue trackers.
What "better" actually means.
Precision
Of findings flagged as valid, what fraction is technically correct? Measured by independent annotation against known truth.
Recall
Of known bugs in the dataset, what fraction did Sigilix surface? Caught vs. missed, with explicit per-class breakdown.
Noise rate
False positives per 100 lines changed. The single most important metric for whether reviewers turn off the bot.
Actionability
Does each finding name a file, a line, and a concrete fix — or is it vague advice the author can't act on?
Controlled comparison.
| Dimension | Sigilix | Single-agent GPT-4o | Human baseline |
|---|---|---|---|
| Precision | Q3 audit | TBD | TBD |
| Recall | Q3 audit | TBD | TBD |
| Noise rate | Q3 audit | TBD | TBD |
| Actionability | Q3 audit | TBD | TBD |
| Latency | Q3 audit | TBD | TBD |
All cells return numbers when the Q3 2026 audit publishes. We will not invent placeholders.
Internal telemetry preview.
Reviews processed
Avg findings per PR
Hallucination rate
Down from 34% in v0.9
Customer code retained
These numbers are not audited. They are generated from our own pipeline logs over the trailing 30-day window. The true benchmark is coming — see Section 05 for the timeline.
When you can trust the scoreboard.
Get the audit report
We'll email the published audit when it lands — and only that. No newsletter, no drip campaign.
Honest measurement is part of the product.
If we promised numbers we couldn't defend, we'd burn the trust we're trying to earn.