ReproducibleRuns on every PR targeting main / develop

Public eval harness

A deterministic regression harness over the AiSOC substrate — the keyword extractors, the in-harness fusion grouping (a faithful re-implementation of the production Tier 1/2/3 logic in services/fusion, minus the DB-backed dedup and ML scoring), the report and response templates, and the offline judges that grade them. The dataset, the harness, and the CI gate are in the repo. The numbers on this page reproduce in roughly 25 ms on a laptop.

Read this first: the harness does not exercise the live LLM agent. It runs deterministic substrate code against synthetic data so the CI gate can run in milliseconds. Three of the four metrics measure internal consistency of that substrate, not agent accuracy. The sections below describe what each suite measures and what it does not.

View dataset View harness Latest CI run

Latest results

Four metrics, four CI gates. A regression on any gate blocks the build. The numbers below come from the most recent successful run on main. Each card describes what the metric measures and what it does not.

Reduction ratio

Alert reduction

Pass

75.3%target ≥70%

+5.3 pts above gate

Real measurement

A 1,000-alert noisy stream (duplicates, near-duplicates, rule storms, low-score chatter) is fed into an in-harness re-implementation of the production Tier 1 / 2 / 3 grouping rules — same logic, no DB-backed dedup or ML scorer. The number is whatever the code produces; a regression in the grouping rules moves it.

Alerts in

1,000

Incidents out

247

Storms

Tactic accuracy

MITRE tactic accuracy

Pass

97.0%target ≥80%

+17.0 pts above gate

Substrate self-consistency

Each synthetic incident is generated with a labeled tactic and a description written to include keywords the hand-curated extractor recognizes. The 97% mostly checks that dataset and extractor agree — useful as a regression sentinel for the extractor, not a measure of LLM agent accuracy.

Incidents

200

Correct

194

F1 (per-case)

0.78

Mean keyword coverage

Investigation completeness

Pass

94.3%target ≥85%

+9.3 pts above gate

Substrate self-consistency

The simulator wraps each incident's description in a Markdown report; the judge then looks for evidence keywords drawn from that same description. Close to a string-copy tautology — it confirms the report template includes the description and the judge can find keywords inside it. Catches template breakage, not LLM quality.

Incidents

200

Fully covered

134 (67%)

Judge

Offline keyword

Mean rubric score

Response-plan quality

Pass

100%target ≥80%

+20.0 pts above gate

Substrate self-consistency

The synthesizer embeds the expected MITRE techniques and first evidence keyword directly into the templated plan, then a 5-criterion rubric checks for them. By construction the score is ~1.000. Catches a broken templating pipeline; it is not a grade of LLM-written plans.

Incidents

200

Criteria

5 (all hit by template)

Judge

Offline keyword

What each suite measures

Alert reduction (75.3%)

Real measurement

A 1,000-alert noisy stream with duplicates, near-duplicates, rule-storms, and benign chatter is fabricated deterministically, then passed through fuse_alerts — an in-harness re-implementation of the same Tier 1 / 2 / 3 merge windows and score floor that the production fusion service runs. The grouping logic is identical; the harness skips the DB-backed deduplicator and ML scorer that ride on top in production. The reduction ratio is whatever the harness code emits. This is a legitimate measurement of the grouping logic, and a regression in those rules will move the number.

MITRE tactic accuracy (97.0%)

Substrate self-consistency

Each synthetic incident is generated with a tactic label, and its description is written to include keywords that the hand-curated extractor recognises. The 97% is therefore largely a check that the dataset and the extractor agree with each other, not a measure of LLM-agent accuracy. The gate still has value as a regression sentinel: a misnamed tactic, a typo in the keyword table, or a lost tactic will fail it.

Investigation completeness (94.3%)

Substrate self-consistency

The simulator wraps the incident description in a Markdown report, and the judge looks for evidence keywords inside it. Those evidence keywords are drawn from the description, so the gate confirms that the report template includes the description and that the judge can find the keywords. It catches drops in the report template (for example a missing Summary section) but does not grade an LLM-written investigation.

Response-plan quality (1.000)

Substrate self-consistency

The synthesiser embeds the expected MITRE techniques and the first evidence keyword directly into the templated plan, and the rubric judge checks for them. The score is ~1.000 by construction. This catches a broken templating pipeline (for example, the synthesiser silently dropping an action class) but is not a grade of LLM output. The 1.000 is a green regression-gate signal, not a quality measurement.

The next milestone is an online eval: nightly runs that drive the real LangGraph agent against the same dataset, with an LLM-as-judge gated by OPENAI_API_KEY. That is the run where actual agent accuracy is measured. Tracking issue: github.com/beenuar/AiSOC/issues.

Reproduce these numbers

No Docker, no API key, no GPU, no LLM call. The harness is deterministic and runs in roughly 25 ms.

git clone https://github.com/beenuar/AiSOC && cd AiSOC
python3 scripts/run_evals.py

Expected output:

============================================================================
  AiSOC Pillar-1 Eval - 200-incident synthetic benchmark
============================================================================
  [PASS] mitre_accuracy               accuracy               0.970  (target >= 0.80)
  [PASS] alert_reduction              reduction_ratio        0.753  (target >= 0.70)
  [PASS] investigation_completeness   mean_keyword_coverage  0.943  (target >= 0.85)
  [PASS] response_quality             mean_rubric_score      1.000  (target >= 0.80)
============================================================================
  ALL GATES PASSED

For machine-readable output, pass --json or --ci --out report.json (the latter also exits non-zero on regression).

Comparison to other AI SOC offerings

Where a vendor publishes a number or a verifiable capability, it is cited. Where a vendor does not, the row is marked absent.

Product	Alert reduction	MITRE accuracy gate	Decision audit	Self-host	Reproducible harness
AiSOCOpen	75.3% (measured on fixed noisy stream)	97% (substrate regression gate)	Per-step ledger	Yes (MIT)	Yes — every PR to main / develop
Closed-source AI SOCClosed	Vendor claim, no harness	Not published	Vendor portal	No (cloud only)	No
Closed-source SOARClosed	N/A (SOAR)	Not applicable	Run history	On-prem option	No published harness

A self-hostable, MIT-licensed agent with a published regression harness can be reviewed directly by an auditor. Vendor cloud agents typically cannot be reviewed at the same level.

What this is not

No LLM agent runs here. The harness exercises deterministic substrate code: extractors, fusion, templates, and keyword judges. The live LangGraph orchestrator (services/agents/app/investigator/) is not invoked. An online eval that drives it nightly is on the roadmap.
The dataset is synthetic. 200 incidents are enough to flag substrate regressions but not enough to claim production parity. Federated, opt-in real-customer evaluation is on the roadmap.
The judges are keyword-based. They can be gamed by template-stuffing. In three of the four suites the templates already include the keywords the judge looks for, which is why those suites are labelled substrate self-consistency rather than agent quality. The LLM-as-judge variant is the follow-up.
“Public eval harness” means this harness, not a third-party leaderboard. No outside body grades AiSOC. The dataset, the code, and the gates are open and CI-enforced, and anyone can run, audit, or extend the harness.

Contributing to the harness

New fixtures for missed tactics or fusion edge cases, replacements for tautological judges, and the online LLM-as-judge variant are all in scope for contributions.

Contributing guide Back to AiSOC