# LoadLens — Evaluation Protocol (Pre-Registered) **Version:** 1.0 **Pre-registered:** 2026-05-08 **Status:** Active **Owner:** Champlin Enterprises LLC This document describes the evaluation methodology used to measure LoadLens forecast accuracy. It is **pre-registered**: the test sets, baselines, metrics, and stratifications below were declared *before* model selection and tuning. Any change to this protocol must be a new version with a new timestamp; prior versions remain archived in git history. The purpose of pre-registration is to remove the "we ran 50 evals and reported the best one" failure mode that is endemic to applied ML claims. A grant reviewer or pilot cooperative reading this document should be able to reproduce the numbers exactly from the data on disk and the code at the referenced commit. ## 1. Hypothesis under test LoadLens's adaptive ensemble produces hourly load forecasts that are more accurate, better calibrated, and more robust under regime shift than (a) standard naive baselines and (b) static equal-weight ensembles, on held-out data the model has never seen. The claim is operational, not theoretical. The unit of credit is **dollars saved per MWh** for a rural electric cooperative under realistic dispatch decisions, not raw MAPE. ## 2. Datasets | Slice | Source | Coverage | Use | |---|---|---|---| | `Z2_HIST` | Zone 2 (PJM via EIA Open Data) | 2025-04-12 → 2025-12-31 | Training / weight-warmup | | `Z2_HOLDOUT_2026Q1` | Zone 2 | 2026-01-01 → 2026-03-31 | Temporal hold-out (primary) | | `Z2_HOLDOUT_2026Q2` | Zone 2 | 2026-04-01 → present | Continuing temporal hold-out | | `Z3_HIST` | Zone 3 (PJM via EIA Open Data) | 2025-04-12 → 2026-04-12 | Geographic / second-zone test | | `SYN_MIDWEST` | `loadlens:seed-synthetic` | 12 months synthetic | Cross-region generalization (proxy until pilot coop signed) | `SYN_MIDWEST` is generated with a different load shape (rural-coop profile: higher seasonal swing, summer irrigation peak, winter electric-heat shoulder) than the PJM zones. It is a *proxy* for cross-region generalization and is clearly labeled as such in all reporting. It does **not** substitute for evaluation on real cooperative data; that will replace this slice as soon as a pilot agreement is signed. ## 3. Holdout strategy Three independent stratifications. Each is held out *separately* — the model must do well on every stratum, not on a marginal average that hides regime-specific failure. ### 3.1 Temporal hold-out Train on `Z2_HIST`. Evaluate on `Z2_HOLDOUT_2026Q1` and `Z2_HOLDOUT_2026Q2`. No data after 2025-12-31 is ever shown to the model during weight-warmup. ### 3.2 Geographic hold-out Train on `Z2_HIST`. Evaluate on `Z3_HIST` *with no zone-3 data in training*. Zones share an ISO (PJM) but are physically and demographically distinct. Cross-region generalization to a non-PJM ISO is approximated by `SYN_MIDWEST`. ### 3.3 Regime-stratified hold-out Within each temporal hold-out, partition the test hours into named regimes and report metrics *per regime*: - `BASELINE` — temperate hours (45°F ≤ T ≤ 75°F), weekday, non-holiday - `HEAT_DOME` — hours where temperature exceeds the local 95th percentile - `COLD_SNAP` — hours below the local 5th percentile - `WEEKEND` — Saturday / Sunday - `HOLIDAY` — US federal holidays (precomputed list) - `RAMP` — first three hours of morning (06:00–09:00) and evening (16:00–19:00) ramps A model that achieves 12% MAPE on `BASELINE` and 35% on `HEAT_DOME` is **not** a 12% MAPE model. The grant claim must report all six. ## 4. Baselines (non-negotiable) Every model is compared against all four baselines. A claim of improvement that does not beat these is rejected. | Code | Description | Why it matters | |---|---|---| | `B_PERSIST_168` | Persistence at lag 168 — load(t) = load(t − 168h) | Industry-standard naive baseline. Beats it = real signal exists. | | `B_SEASONAL_NAIVE` | Mean of same-hour-of-week over previous 4 weeks | Catches weekly seasonality without learning. | | `B_HOUR_DOW_MEAN` | Rolling 28-day mean stratified by (hour, day-of-week) | Standard hour-of-week climatology baseline used by ISOs. | | `B_LINEAR_TEMP` | OLS on `[hour-of-day, day-of-week, temp_f, temp_f²]` | Smallest model that uses weather. Beats it = ensemble earned its complexity. | ## 5. Metrics Reported for every (model × holdout × regime) cell. ### 5.1 Point-forecast metrics - **MAPE** — mean absolute percentage error. Headline metric. - **sMAPE** — symmetric MAPE. Robust to small denominators. - **RMSE** — penalizes large misses (matters for reserve margin). - **MAE** — absolute error in MW. - **Peak-hour MAPE** — MAPE restricted to daily peak-load hours only. Operationally the most consequential. ### 5.2 Probabilistic metrics - **Pinball loss** at quantiles {0.1, 0.5, 0.9}. Lower is better. - **Coverage** — fraction of actual values inside the stated `[q10, q90]` interval. Target: 0.80 ± 0.02. A model with "great MAPE" but 0.55 coverage is operationally broken. - **Sharpness** — mean width of `[q10, q90]` interval. Narrower is better, *only conditional on coverage hitting target*. ### 5.3 Operational metrics - **Reserve-margin error at P95** — error of the 95th-percentile forecast vs realized peak. Drives capacity-procurement decisions. - **Direction accuracy on regime change** — fraction of CUSUM-flagged regime changes where the forecast moved in the correct direction within 3 hours of the flag. ## 6. Reporting The output of `php artisan loadlens:eval` is two artifacts: 1. `storage/app/eval/{ISO_TIMESTAMP}.json` — the full results matrix. Pinned by SHA in the git log. 2. `storage/app/eval/latest.json` — symlink/copy used by the public methodology page. The marketing page never displays a metric that is not present in `latest.json`. There are no hand-edited numbers in the public chrome. ## 7. Falsification conditions The hypothesis is **rejected** for any cell where: - The adaptive ensemble's MAPE is worse than `B_HOUR_DOW_MEAN`. - Coverage of `[q10, q90]` deviates from 0.80 by more than 0.05 absolute. - Peak-hour MAPE on `HEAT_DOME` or `COLD_SNAP` is worse than the temporal `BASELINE` cell of `B_PERSIST_168`. Rejected cells are reported alongside accepted cells. We do not silently drop them. ## 8. Reproducibility ```bash ssh ce-prod "cd /var/www/vhosts/champlinenterprises.com/loadlens.champlinenterprises.com \ && /opt/plesk/php/8.4/bin/php artisan loadlens:eval --slice=all --pre-registered=v1" ``` The command refuses to run if the protocol version on disk does not match `--pre-registered=`. This prevents accidental drift between protocol and code. ## 9. What this protocol does NOT cover - Long-horizon (>24h) forecasts. Day-ahead is in scope; week-ahead is future work. - Sub-hourly granularity. 5-minute and 15-minute forecasts require AMI ingestion that is not yet wired. - Probabilistic load-flow on the distribution feeder. Out of scope for v1. These are deliberately deferred to keep the v1 claim falsifiable.