Benchmark Accuracy Testing¶
Test SPOT's phishing detection accuracy against labeled datasets. The benchmark system normalizes various dataset formats, submits emails to SPOT via the standard API, and computes accuracy metrics.
Quick Start¶
# 1. Normalize a small dataset (2K emails, fast)
spot-cli benchmark normalize --dataset miltchev
# 2. Run the benchmark
spot-cli benchmark run --dataset miltchev --workflow default-workflow
# 3. View results
spot-cli benchmark report --latest
Commands¶
benchmark datasets¶
List all available datasets and their normalization status.
benchmark normalize¶
Convert raw datasets into SPOT-compatible format (JSONL). This is a one-time step per dataset.
# Normalize one dataset
spot-cli benchmark normalize --dataset alhuzali_balanced
# Normalize all datasets
spot-cli benchmark normalize --all
# Normalize first 100 emails per dataset (for quick testing)
spot-cli benchmark normalize --all --sample 100
Normalized data is stored in core/benchmarks/normalized/<dataset>/emails.jsonl.
benchmark run¶
Submit normalized emails to SPOT and collect results. Emails are submitted via the same POST /api/v1/analyze endpoint used by spot-cli analyze submit.
# Benchmark a specific dataset with a specific workflow
spot-cli benchmark run --dataset miltchev --workflow nlp-only-workflow
# Benchmark all normalized datasets
spot-cli benchmark run --all --workflow default-workflow
# Limit to 500 emails with higher concurrency
spot-cli benchmark run --dataset alhuzali_balanced --max-emails 500 --concurrency 50
# Only test phishing detection (skip legitimate emails)
spot-cli benchmark run --dataset champa_enron --label phishing
# Preview without submitting
spot-cli benchmark run --dry-run --dataset miltchev
# Resume an interrupted run
spot-cli benchmark run --resume run_2026-04-08_14-30-00
Results are stored in core/benchmarks/runs/<run_id>/results.jsonl.
benchmark report¶
Compute and display accuracy metrics for a completed run.
# Report on a specific run
spot-cli benchmark report run_2026-04-08_14-30-00
# Report on the most recent run
spot-cli benchmark report --latest
# Show per-dataset breakdown
spot-cli benchmark report --latest --per-dataset
# Export as JSON
spot-cli benchmark report --latest --json
benchmark compare¶
Compare two runs side-by-side (e.g., different workflows, before/after a change).
benchmark list¶
List all benchmark runs.
Datasets¶
Included Datasets¶
| Dataset | Emails | Labels | Source |
|---|---|---|---|
alhuzali_balanced | ~198K | body + numeric label (0/1) | Zenodo (CC BY 4.0) |
alhuzali_merged | ~213K | body + numeric label (0/1) | Zenodo (CC BY 4.0) |
champa_ceas08 | ~39K | full headers + binary label | Zenodo (CC BY 4.0) |
champa_enron | ~30K | subject + body + binary label | Zenodo (CC BY 4.0) |
champa_ling | ~3K | subject + body + binary label | Zenodo (CC BY 4.0) |
champa_nazario | ~1.5K | full headers, phishing only | Zenodo (CC BY 4.0) |
champa_nazario5 | ~3K | nazario + legitimate baseline | Zenodo (CC BY 4.0) |
champa_nigerian | ~3.3K | Nigerian fraud, phishing only | Zenodo (CC BY 4.0) |
champa_nigerian5 | ~6.3K | Nigerian + legitimate baseline | Zenodo (CC BY 4.0) |
champa_spamassassin | ~5.8K | SpamAssassin corpus | Zenodo (CC BY 4.0) |
champa_trec05 | ~55K | TREC 2005 spam track | Zenodo (CC BY 4.0) |
champa_trec06 | ~16K | TREC 2006 spam track | Zenodo (CC BY 4.0) |
champa_trec07 | ~54K | TREC 2007 spam track | Zenodo (CC BY 4.0) |
nahmiasd_enron_ham | ~3K | Enron legitimate emails | GitHub (CC BY-NC-SA 4.0) |
nahmiasd_hard_ham | ~490 | Hard-to-classify legitimate | GitHub (CC BY-NC-SA 4.0) |
nahmiasd_traditional_phishing | ~3.3K | Traditional phishing (419 scams) | GitHub (CC BY-NC-SA 4.0) |
nahmiasd_spear_phishing | ~334 | Targeted spear phishing | GitHub (CC BY-NC-SA 4.0) |
subhajournal | ~175K | body + text label | Kaggle (LGPL-3.0) |
miltchev | ~2K | body + text label (balanced) | Zenodo (CC BY 4.0) |
pashakhin | ~27K | Maildir, all legitimate | Zenodo (CC BY 4.0) |
Excluded: nc3 (email bodies stored as SHA256 hashes, not usable for content analysis).
Normalization Details¶
Datasets come in different formats with different fields. The normalizer converts them all to SPOT's Email model format:
- Body-only CSVs (alhuzali, subhajournal, miltchev): Synthetic email headers are generated (sender, recipients, message-id, date). Only the body text is real.
- Rich CSVs (champa): Headers mapped from CSV columns where available. Missing fields use synthetic values.
- JSON directories (nahmiasd): Subject and body from JSON keys. Labels from directory name (e.g.,
spear_phishing/= phishing). - Maildir (pashakhin): Full SMTP headers and MIME body extraction. Labels from
X-Spam-Flagheader or fixed assignment.
Synthetic fields use the @benchmark.spot.local domain and are clearly identifiable.
Metrics¶
The following metrics are computed from the confusion matrix:
| Metric | Formula | What it measures |
|---|---|---|
| Precision | TP / (TP + FP) | Of flagged emails, how many are actually phishing? |
| Recall | TP / (TP + FN) | Of actual phishing, how many did we catch? |
| F1 | 2 * P * R / (P + R) | Harmonic mean of precision and recall |
| F2 | 5 * P * R / (4P + R) | F-score weighted toward recall (catching phishing matters more) |
| FPR | FP / (FP + TN) | False alarm rate (legitimate flagged as phishing) |
| FNR | FN / (TP + FN) | Miss rate (phishing classified as legitimate) |
| Accuracy | (TP + TN) / Total | Overall correctness |
Where:
- TP (True Positive): Phishing email correctly detected
- TN (True Negative): Legitimate email correctly classified
- FP (False Positive): Legitimate email incorrectly flagged as phishing
- FN (False Negative): Phishing email missed (classified as legitimate)
Key metrics for phishing detection: - Recall is critical -- missing a phishing email has higher cost than a false alarm - F2 weights recall higher than precision, making it the best single metric for phishing detection - FPR should be minimized to avoid alert fatigue
Typical Workflow¶
Initial accuracy baseline¶
# Normalize the small validation set
spot-cli benchmark normalize --dataset miltchev
# Quick sanity check
spot-cli benchmark run --dataset miltchev --workflow default-workflow
spot-cli benchmark report --latest
Compare workflows¶
spot-cli benchmark run --dataset miltchev --workflow nlp-only-workflow
spot-cli benchmark run --dataset miltchev --workflow default-workflow
spot-cli benchmark compare run_<first> run_<second>
Large-scale test¶
# Normalize everything
spot-cli benchmark normalize --all
# Run with higher concurrency
spot-cli benchmark run --all --workflow default-workflow --concurrency 50
# Detailed report
spot-cli benchmark report --latest --per-dataset
Testing specific phishing types¶
# Test against spear phishing specifically
spot-cli benchmark run --dataset nahmiasd_spear_phishing --workflow default-workflow
# Test against traditional phishing
spot-cli benchmark run --dataset nahmiasd_traditional_phishing --workflow default-workflow
File Structure¶
core/benchmarks/ # .gitignored, created at runtime
normalized/ # Pre-normalized datasets
miltchev/
manifest.json # Count, label distribution
emails.jsonl # One NormalizedEmail per line
alhuzali_balanced/
...
runs/ # Benchmark results
run_2026-04-08_14-30-00/
config.json # Run parameters
results.jsonl # One BenchmarkResult per line
metrics.json # Computed metrics (after report)
Adding New Datasets¶
- Place the dataset in
<spot-workspace>/datasets/ - Add an entry to
core/cli/commands/benchmark/datasets.pywith the appropriateDatasetInfo - If the format is new, add a parser function in
normalize.py - Run
spot-cli benchmark normalize --dataset <name>to verify