Skip to content

Benchmark Accuracy Testing

Test SPOT's phishing detection accuracy against labeled datasets. The benchmark system normalizes various dataset formats, submits emails to SPOT via the standard API, and computes accuracy metrics.

Quick Start

# 1. Normalize a small dataset (2K emails, fast)
spot-cli benchmark normalize --dataset miltchev

# 2. Run the benchmark
spot-cli benchmark run --dataset miltchev --workflow default-workflow

# 3. View results
spot-cli benchmark report --latest

Commands

benchmark datasets

List all available datasets and their normalization status.

spot-cli benchmark datasets

benchmark normalize

Convert raw datasets into SPOT-compatible format (JSONL). This is a one-time step per dataset.

# Normalize one dataset
spot-cli benchmark normalize --dataset alhuzali_balanced

# Normalize all datasets
spot-cli benchmark normalize --all

# Normalize first 100 emails per dataset (for quick testing)
spot-cli benchmark normalize --all --sample 100

Normalized data is stored in core/benchmarks/normalized/<dataset>/emails.jsonl.

benchmark run

Submit normalized emails to SPOT and collect results. Emails are submitted via the same POST /api/v1/analyze endpoint used by spot-cli analyze submit.

# Benchmark a specific dataset with a specific workflow
spot-cli benchmark run --dataset miltchev --workflow nlp-only-workflow

# Benchmark all normalized datasets
spot-cli benchmark run --all --workflow default-workflow

# Limit to 500 emails with higher concurrency
spot-cli benchmark run --dataset alhuzali_balanced --max-emails 500 --concurrency 50

# Only test phishing detection (skip legitimate emails)
spot-cli benchmark run --dataset champa_enron --label phishing

# Preview without submitting
spot-cli benchmark run --dry-run --dataset miltchev

# Resume an interrupted run
spot-cli benchmark run --resume run_2026-04-08_14-30-00

Results are stored in core/benchmarks/runs/<run_id>/results.jsonl.

benchmark report

Compute and display accuracy metrics for a completed run.

# Report on a specific run
spot-cli benchmark report run_2026-04-08_14-30-00

# Report on the most recent run
spot-cli benchmark report --latest

# Show per-dataset breakdown
spot-cli benchmark report --latest --per-dataset

# Export as JSON
spot-cli benchmark report --latest --json

benchmark compare

Compare two runs side-by-side (e.g., different workflows, before/after a change).

spot-cli benchmark compare run_2026-04-08_14-30-00 run_2026-04-08_15-00-00

benchmark list

List all benchmark runs.

spot-cli benchmark list

Datasets

Included Datasets

Dataset Emails Labels Source
alhuzali_balanced ~198K body + numeric label (0/1) Zenodo (CC BY 4.0)
alhuzali_merged ~213K body + numeric label (0/1) Zenodo (CC BY 4.0)
champa_ceas08 ~39K full headers + binary label Zenodo (CC BY 4.0)
champa_enron ~30K subject + body + binary label Zenodo (CC BY 4.0)
champa_ling ~3K subject + body + binary label Zenodo (CC BY 4.0)
champa_nazario ~1.5K full headers, phishing only Zenodo (CC BY 4.0)
champa_nazario5 ~3K nazario + legitimate baseline Zenodo (CC BY 4.0)
champa_nigerian ~3.3K Nigerian fraud, phishing only Zenodo (CC BY 4.0)
champa_nigerian5 ~6.3K Nigerian + legitimate baseline Zenodo (CC BY 4.0)
champa_spamassassin ~5.8K SpamAssassin corpus Zenodo (CC BY 4.0)
champa_trec05 ~55K TREC 2005 spam track Zenodo (CC BY 4.0)
champa_trec06 ~16K TREC 2006 spam track Zenodo (CC BY 4.0)
champa_trec07 ~54K TREC 2007 spam track Zenodo (CC BY 4.0)
nahmiasd_enron_ham ~3K Enron legitimate emails GitHub (CC BY-NC-SA 4.0)
nahmiasd_hard_ham ~490 Hard-to-classify legitimate GitHub (CC BY-NC-SA 4.0)
nahmiasd_traditional_phishing ~3.3K Traditional phishing (419 scams) GitHub (CC BY-NC-SA 4.0)
nahmiasd_spear_phishing ~334 Targeted spear phishing GitHub (CC BY-NC-SA 4.0)
subhajournal ~175K body + text label Kaggle (LGPL-3.0)
miltchev ~2K body + text label (balanced) Zenodo (CC BY 4.0)
pashakhin ~27K Maildir, all legitimate Zenodo (CC BY 4.0)

Excluded: nc3 (email bodies stored as SHA256 hashes, not usable for content analysis).

Normalization Details

Datasets come in different formats with different fields. The normalizer converts them all to SPOT's Email model format:

  • Body-only CSVs (alhuzali, subhajournal, miltchev): Synthetic email headers are generated (sender, recipients, message-id, date). Only the body text is real.
  • Rich CSVs (champa): Headers mapped from CSV columns where available. Missing fields use synthetic values.
  • JSON directories (nahmiasd): Subject and body from JSON keys. Labels from directory name (e.g., spear_phishing/ = phishing).
  • Maildir (pashakhin): Full SMTP headers and MIME body extraction. Labels from X-Spam-Flag header or fixed assignment.

Synthetic fields use the @benchmark.spot.local domain and are clearly identifiable.

Metrics

The following metrics are computed from the confusion matrix:

Metric Formula What it measures
Precision TP / (TP + FP) Of flagged emails, how many are actually phishing?
Recall TP / (TP + FN) Of actual phishing, how many did we catch?
F1 2 * P * R / (P + R) Harmonic mean of precision and recall
F2 5 * P * R / (4P + R) F-score weighted toward recall (catching phishing matters more)
FPR FP / (FP + TN) False alarm rate (legitimate flagged as phishing)
FNR FN / (TP + FN) Miss rate (phishing classified as legitimate)
Accuracy (TP + TN) / Total Overall correctness

Where:

  • TP (True Positive): Phishing email correctly detected
  • TN (True Negative): Legitimate email correctly classified
  • FP (False Positive): Legitimate email incorrectly flagged as phishing
  • FN (False Negative): Phishing email missed (classified as legitimate)

Key metrics for phishing detection: - Recall is critical -- missing a phishing email has higher cost than a false alarm - F2 weights recall higher than precision, making it the best single metric for phishing detection - FPR should be minimized to avoid alert fatigue

Typical Workflow

Initial accuracy baseline

# Normalize the small validation set
spot-cli benchmark normalize --dataset miltchev

# Quick sanity check
spot-cli benchmark run --dataset miltchev --workflow default-workflow
spot-cli benchmark report --latest

Compare workflows

spot-cli benchmark run --dataset miltchev --workflow nlp-only-workflow
spot-cli benchmark run --dataset miltchev --workflow default-workflow
spot-cli benchmark compare run_<first> run_<second>

Large-scale test

# Normalize everything
spot-cli benchmark normalize --all

# Run with higher concurrency
spot-cli benchmark run --all --workflow default-workflow --concurrency 50

# Detailed report
spot-cli benchmark report --latest --per-dataset

Testing specific phishing types

# Test against spear phishing specifically
spot-cli benchmark run --dataset nahmiasd_spear_phishing --workflow default-workflow

# Test against traditional phishing
spot-cli benchmark run --dataset nahmiasd_traditional_phishing --workflow default-workflow

File Structure

core/benchmarks/                    # .gitignored, created at runtime
    normalized/                     # Pre-normalized datasets
        miltchev/
            manifest.json           # Count, label distribution
            emails.jsonl            # One NormalizedEmail per line
        alhuzali_balanced/
            ...
    runs/                           # Benchmark results
        run_2026-04-08_14-30-00/
            config.json             # Run parameters
            results.jsonl           # One BenchmarkResult per line
            metrics.json            # Computed metrics (after report)

Adding New Datasets

  1. Place the dataset in <spot-workspace>/datasets/
  2. Add an entry to core/cli/commands/benchmark/datasets.py with the appropriate DatasetInfo
  3. If the format is new, add a parser function in normalize.py
  4. Run spot-cli benchmark normalize --dataset <name> to verify