Public eval harness
A deterministic regression harness over the AiSOC substrate — the keyword extractors, the in-harness fusion grouping (a faithful re-implementation of the production Tier 1/2/3 logic in services/fusion, minus the DB-backed dedup and ML scoring), the report and response templates, and the offline judges that grade them. The dataset, the harness, and the CI gate are in the repo. The numbers on this page reproduce in roughly 25 ms on a laptop.
Latest results
Four metrics, four CI gates. A regression on any gate blocks the build. The numbers below come from the most recent successful run on main. Each card describes what the metric measures and what it does not.
Reduction ratio
Alert reduction
A 1,000-alert noisy stream (duplicates, near-duplicates, rule storms, low-score chatter) is fed into an in-harness re-implementation of the production Tier 1 / 2 / 3 grouping rules — same logic, no DB-backed dedup or ML scorer. The number is whatever the code produces; a regression in the grouping rules moves it.
Tactic accuracy
MITRE tactic accuracy
Each synthetic incident is generated with a labeled tactic and a description written to include keywords the hand-curated extractor recognizes. The 97% mostly checks that dataset and extractor agree — useful as a regression sentinel for the extractor, not a measure of LLM agent accuracy.
Mean keyword coverage
Investigation completeness
The simulator wraps each incident's description in a Markdown report; the judge then looks for evidence keywords drawn from that same description. Close to a string-copy tautology — it confirms the report template includes the description and the judge can find keywords inside it. Catches template breakage, not LLM quality.
Mean rubric score
Response-plan quality
The synthesizer embeds the expected MITRE techniques and first evidence keyword directly into the templated plan, then a 5-criterion rubric checks for them. By construction the score is ~1.000. Catches a broken templating pipeline; it is not a grade of LLM-written plans.
What each suite measures
Alert reduction (75.3%)
Real measurementA 1,000-alert noisy stream with duplicates, near-duplicates, rule-storms, and benign chatter is fabricated deterministically, then passed through fuse_alerts — an in-harness re-implementation of the same Tier 1 / 2 / 3 merge windows and score floor that the production fusion service runs. The grouping logic is identical; the harness skips the DB-backed deduplicator and ML scorer that ride on top in production. The reduction ratio is whatever the harness code emits. This is a legitimate measurement of the grouping logic, and a regression in those rules will move the number.
MITRE tactic accuracy (97.0%)
Substrate self-consistencyEach synthetic incident is generated with a tactic label, and its description is written to include keywords that the hand-curated extractor recognises. The 97% is therefore largely a check that the dataset and the extractor agree with each other, not a measure of LLM-agent accuracy. The gate still has value as a regression sentinel: a misnamed tactic, a typo in the keyword table, or a lost tactic will fail it.
Investigation completeness (94.3%)
Substrate self-consistencyThe simulator wraps the incident description in a Markdown report, and the judge looks for evidence keywords inside it. Those evidence keywords are drawn from the description, so the gate confirms that the report template includes the description and that the judge can find the keywords. It catches drops in the report template (for example a missing Summary section) but does not grade an LLM-written investigation.
Response-plan quality (1.000)
Substrate self-consistencyThe synthesiser embeds the expected MITRE techniques and the first evidence keyword directly into the templated plan, and the rubric judge checks for them. The score is ~1.000 by construction. This catches a broken templating pipeline (for example, the synthesiser silently dropping an action class) but is not a grade of LLM output. The 1.000 is a green regression-gate signal, not a quality measurement.
The next milestone is an online eval: nightly runs that drive the real LangGraph agent against the same dataset, with an LLM-as-judge gated by OPENAI_API_KEY. That is the run where actual agent accuracy is measured. Tracking issue: github.com/beenuar/AiSOC/issues.
Reproduce these numbers
No Docker, no API key, no GPU, no LLM call. The harness is deterministic and runs in roughly 25 ms.
git clone https://github.com/beenuar/AiSOC && cd AiSOC
python3 scripts/run_evals.pyExpected output:
============================================================================
AiSOC Pillar-1 Eval - 200-incident synthetic benchmark
============================================================================
[PASS] mitre_accuracy accuracy 0.970 (target >= 0.80)
[PASS] alert_reduction reduction_ratio 0.753 (target >= 0.70)
[PASS] investigation_completeness mean_keyword_coverage 0.943 (target >= 0.85)
[PASS] response_quality mean_rubric_score 1.000 (target >= 0.80)
============================================================================
ALL GATES PASSEDFor machine-readable output, pass --json or --ci --out report.json (the latter also exits non-zero on regression).
Comparison to other AI SOC offerings
Where a vendor publishes a number or a verifiable capability, it is cited. Where a vendor does not, the row is marked absent.
| Product | Alert reduction | MITRE accuracy gate | Decision audit | Self-host | Reproducible harness |
|---|---|---|---|---|---|
AiSOCOpen | 75.3% (measured on fixed noisy stream) | 97% (substrate regression gate) | Per-step ledger | Yes (MIT) | Yes — every PR to main / develop |
Closed-source AI SOCClosed | Vendor claim, no harness | Not published | Vendor portal | No (cloud only) | No |
Closed-source SOARClosed | N/A (SOAR) | Not applicable | Run history | On-prem option | No published harness |
A self-hostable, MIT-licensed agent with a published regression harness can be reviewed directly by an auditor. Vendor cloud agents typically cannot be reviewed at the same level.
What this is not
- No LLM agent runs here. The harness exercises deterministic substrate code: extractors, fusion, templates, and keyword judges. The live LangGraph orchestrator (
services/agents/app/investigator/) is not invoked. An online eval that drives it nightly is on the roadmap. - The dataset is synthetic. 200 incidents are enough to flag substrate regressions but not enough to claim production parity. Federated, opt-in real-customer evaluation is on the roadmap.
- The judges are keyword-based. They can be gamed by template-stuffing. In three of the four suites the templates already include the keywords the judge looks for, which is why those suites are labelled substrate self-consistency rather than agent quality. The LLM-as-judge variant is the follow-up.
- “Public eval harness” means this harness, not a third-party leaderboard. No outside body grades AiSOC. The dataset, the code, and the gates are open and CI-enforced, and anyone can run, audit, or extend the harness.
Contributing to the harness
New fixtures for missed tactics or fusion edge cases, replacements for tautological judges, and the online LLM-as-judge variant are all in scope for contributions.