Benchmark Accuracy Testing¶

Test SPOT's phishing detection accuracy against labeled datasets. The benchmark system normalizes various dataset formats, submits emails to SPOT via the standard API, and computes accuracy metrics.

Quick Start¶

# 1. Normalize a small dataset (2K emails, fast)
spot-cli benchmark normalize --dataset miltchev

# 2. Run the benchmark
spot-cli benchmark run --dataset miltchev --workflow default-workflow

# 3. View results
spot-cli benchmark report --latest

Commands¶

`benchmark datasets`¶

List all available datasets and their normalization status.

spot-cli benchmark datasets

`benchmark normalize`¶

Convert raw datasets into SPOT-compatible format (JSONL). This is a one-time step per dataset.

# Normalize one dataset
spot-cli benchmark normalize --dataset alhuzali_balanced

# Normalize all datasets
spot-cli benchmark normalize --all

# Normalize first 100 emails per dataset (for quick testing)
spot-cli benchmark normalize --all --sample 100

Normalized data is stored in core/benchmarks/normalized/<dataset>/emails.jsonl.

`benchmark run`¶

Submit normalized emails to SPOT and collect results. Emails are submitted via the same POST /api/v1/analyze endpoint used by spot-cli analyze submit.

# Benchmark a specific dataset with a specific workflow
spot-cli benchmark run --dataset miltchev --workflow nlp-only-workflow

# Benchmark all normalized datasets
spot-cli benchmark run --all --workflow default-workflow

# Limit to 500 emails with higher concurrency
spot-cli benchmark run --dataset alhuzali_balanced --max-emails 500 --concurrency 50

# Only test phishing detection (skip legitimate emails)
spot-cli benchmark run --dataset champa_enron --label phishing

# Preview without submitting
spot-cli benchmark run --dry-run --dataset miltchev

# Resume an interrupted run
spot-cli benchmark run --resume run_2026-04-08_14-30-00

Results are stored in core/benchmarks/runs/<run_id>/results.jsonl.

`benchmark report`¶

Compute and display accuracy metrics for a completed run.

# Report on a specific run
spot-cli benchmark report run_2026-04-08_14-30-00

# Report on the most recent run
spot-cli benchmark report --latest

# Show per-dataset breakdown
spot-cli benchmark report --latest --per-dataset

# Export as JSON
spot-cli benchmark report --latest --json

`benchmark compare`¶

Compare two runs side-by-side (e.g., different workflows, before/after a change).

spot-cli benchmark compare run_2026-04-08_14-30-00 run_2026-04-08_15-00-00

`benchmark list`¶

List all benchmark runs.

spot-cli benchmark list

Datasets¶

Included Datasets¶

Dataset	Emails	Labels	Source
`alhuzali_balanced`	~198K	body + numeric label (0/1)	Zenodo (CC BY 4.0)
`alhuzali_merged`	~213K	body + numeric label (0/1)	Zenodo (CC BY 4.0)
`champa_ceas08`	~39K	full headers + binary label	Zenodo (CC BY 4.0)
`champa_enron`	~30K	subject + body + binary label	Zenodo (CC BY 4.0)
`champa_ling`	~3K	subject + body + binary label	Zenodo (CC BY 4.0)
`champa_nazario`	~1.5K	full headers, phishing only	Zenodo (CC BY 4.0)
`champa_nazario5`	~3K	nazario + legitimate baseline	Zenodo (CC BY 4.0)
`champa_nigerian`	~3.3K	Nigerian fraud, phishing only	Zenodo (CC BY 4.0)
`champa_nigerian5`	~6.3K	Nigerian + legitimate baseline	Zenodo (CC BY 4.0)
`champa_spamassassin`	~5.8K	SpamAssassin corpus	Zenodo (CC BY 4.0)
`champa_trec05`	~55K	TREC 2005 spam track	Zenodo (CC BY 4.0)
`champa_trec06`	~16K	TREC 2006 spam track	Zenodo (CC BY 4.0)
`champa_trec07`	~54K	TREC 2007 spam track	Zenodo (CC BY 4.0)
`nahmiasd_enron_ham`	~3K	Enron legitimate emails	GitHub (CC BY-NC-SA 4.0)
`nahmiasd_hard_ham`	~490	Hard-to-classify legitimate	GitHub (CC BY-NC-SA 4.0)
`nahmiasd_traditional_phishing`	~3.3K	Traditional phishing (419 scams)	GitHub (CC BY-NC-SA 4.0)
`nahmiasd_spear_phishing`	~334	Targeted spear phishing	GitHub (CC BY-NC-SA 4.0)
`subhajournal`	~175K	body + text label	Kaggle (LGPL-3.0)
`miltchev`	~2K	body + text label (balanced)	Zenodo (CC BY 4.0)
`pashakhin`	~27K	Maildir, all legitimate	Zenodo (CC BY 4.0)

Excluded: nc3 (email bodies stored as SHA256 hashes, not usable for content analysis).

Normalization Details¶

Datasets come in different formats with different fields. The normalizer converts them all to SPOT's Email model format:

Body-only CSVs (alhuzali, subhajournal, miltchev): Synthetic email headers are generated (sender, recipients, message-id, date). Only the body text is real.
Rich CSVs (champa): Headers mapped from CSV columns where available. Missing fields use synthetic values.
JSON directories (nahmiasd): Subject and body from JSON keys. Labels from directory name (e.g., spear_phishing/ = phishing).
Maildir (pashakhin): Full SMTP headers and MIME body extraction. Labels from X-Spam-Flag header or fixed assignment.

Synthetic fields use the @benchmark.spot.local domain and are clearly identifiable.

Metrics¶

The following metrics are computed from the confusion matrix:

Metric	Formula	What it measures
Precision	TP / (TP + FP)	Of flagged emails, how many are actually phishing?
Recall	TP / (TP + FN)	Of actual phishing, how many did we catch?
F1	2 * P * R / (P + R)	Harmonic mean of precision and recall
F2	5 * P * R / (4P + R)	F-score weighted toward recall (catching phishing matters more)
FPR	FP / (FP + TN)	False alarm rate (legitimate flagged as phishing)
FNR	FN / (TP + FN)	Miss rate (phishing classified as legitimate)
Accuracy	(TP + TN) / Total	Overall correctness

Where:

TP (True Positive): Phishing email correctly detected
TN (True Negative): Legitimate email correctly classified
FP (False Positive): Legitimate email incorrectly flagged as phishing
FN (False Negative): Phishing email missed (classified as legitimate)

Key metrics for phishing detection: - Recall is critical -- missing a phishing email has higher cost than a false alarm - F2 weights recall higher than precision, making it the best single metric for phishing detection - FPR should be minimized to avoid alert fatigue

Typical Workflow¶

Initial accuracy baseline¶

# Normalize the small validation set
spot-cli benchmark normalize --dataset miltchev

# Quick sanity check
spot-cli benchmark run --dataset miltchev --workflow default-workflow
spot-cli benchmark report --latest

Compare workflows¶

spot-cli benchmark run --dataset miltchev --workflow nlp-only-workflow
spot-cli benchmark run --dataset miltchev --workflow default-workflow
spot-cli benchmark compare run_<first> run_<second>

Large-scale test¶

# Normalize everything
spot-cli benchmark normalize --all

# Run with higher concurrency
spot-cli benchmark run --all --workflow default-workflow --concurrency 50

# Detailed report
spot-cli benchmark report --latest --per-dataset

Testing specific phishing types¶

# Test against spear phishing specifically
spot-cli benchmark run --dataset nahmiasd_spear_phishing --workflow default-workflow

# Test against traditional phishing
spot-cli benchmark run --dataset nahmiasd_traditional_phishing --workflow default-workflow

File Structure¶

core/benchmarks/                    # .gitignored, created at runtime
    normalized/                     # Pre-normalized datasets
        miltchev/
            manifest.json           # Count, label distribution
            emails.jsonl            # One NormalizedEmail per line
        alhuzali_balanced/
            ...
    runs/                           # Benchmark results
        run_2026-04-08_14-30-00/
            config.json             # Run parameters
            results.jsonl           # One BenchmarkResult per line
            metrics.json            # Computed metrics (after report)

Adding New Datasets¶

Place the dataset in <spot-workspace>/datasets/
Add an entry to core/cli/commands/benchmark/datasets.py with the appropriate DatasetInfo
If the format is new, add a parser function in normalize.py
Run spot-cli benchmark normalize --dataset <name> to verify