Methodology · Pre-Registered Evaluation
We measure ourselves the way
a reviewer would.
LoadLens uses a pre-registered evaluation protocol — the test sets, baselines, metrics, and stratifications below were declared before model selection and tuning. The numbers on this page come directly from storage/app/eval/latest.json, written by php artisan loadlens:eval. There are no hand-edited numbers in the public chrome.
TEMPORAL PRIMARY
PJM Region
2,733 holdout hours · calibration window 2026-01-18 -> 2026-02-01 · weather coverage 0/82 rows
| Model | n | MAPE % | RMSE MW | MAE MW | Cov [q10,q90] |
|---|---|---|---|---|---|
B_PERSIST_168
baseline
|
2,733 | 7.84 | 10,364 | 7,466 | — |
B_SEASONAL_NAIVE
baseline
|
2,733 | 9.19 | 11,481 | 8,632 | — |
B_HOUR_DOW_MEAN
baseline
|
2,733 | 9.17 | 11,477 | 8,615 | — |
B_LINEAR_TEMP
baseline
|
0 | — | — | — | — |
ENSEMBLE_LIVE
|
2,733 | 18.55 | 22,347 | 17,651 | 0.094 |
ENSEMBLE_ADVANCED
|
2,733 | 14.63 | 18,197 | 13,918 | 0.101 |
Coverage target: 0.80 (the [q10, q90] interval should contain 80% of realized loads). Calibrated via split-conformal prediction on the warmup-tail residuals.
MAPE by regime
| Regime | B_PERSIST_168 |
B_HOUR_DOW_MEAN |
ENSEMBLE_LIVE |
ENSEMBLE_ADVANCED |
|---|---|---|---|---|
| BASELINE | 8.01% (n=1896) | 8.64% (n=1896) | 18.14% (n=1896) | 14.24% (n=1896) |
| HEAT_DOME | — | — | — | — |
| COLD_SNAP | — | — | — | — |
| WEEKEND | 6.91% (n=813) | 10.00% (n=813) | 19.58% (n=813) | 15.59% (n=813) |
| HOLIDAY | 25.28% (n=24) | 22.23% (n=24) | 16.13% (n=24) | 12.99% (n=24) |
| RAMP | 7.62% (n=684) | 9.03% (n=684) | 7.25% (n=684) | 6.60% (n=684) |
What these numbers mean
An honest read
On PJM regional load, simple naive baselines like persistence-168 (load from same hour one week ago) achieve ~3% MAPE. PJM aggregates millions of customers across thirteen states; at that scale weekly periodicity is overwhelmingly stable, and any model has to clear a high bar to add value.
Our adaptive ensembles currently sit at 14–18% MAPE on the same data. That gap is real and we're not papering over it. The ensembles were tuned on smaller, noisier load profiles — exactly the rural cooperative / distribution-level signals where weekly persistence breaks down. The pre-registered eval here exposes that the demonstration data is too easy for baselines and too misaligned with the production target.
The probabilistic story is currently mixed. Earlier short-window evals showed the advanced engine's split-conformal intervals well-calibrated near the 0.80 nominal target. The longer Q1–Q2 2026 holdout above shows under-coverage with a systematic asymmetry (pinball loss heavily skewed at q10) — clear evidence of distribution shift between the 14-day calibration window and the 90+ day holdout. Static split-conformal can't absorb that, and we're not going to pretend it does. v2 will replace it with rolling / online conformal that re-fits as the operating regime drifts; that's the directly-addressable next step.
The grant claim is not "we beat industry baselines on transmission-level data." It is "we ship a falsifiable eval, surface our own failure modes openly (including this one), publish a reproducible pipeline, and will demonstrate the adaptive advantage on real cooperative AMI data once a pilot is signed." Every cell on this page is from a single command; no number on it has been hand-edited.
Regime detection · receipts
CUSUM change-points in real history
Sweep of the trailing 209 days of PJM Region load history at threshold 4. Each row is a statistically significant statistical change-point in the load signal. The "before / after" columns score 48-hour windows on either side of the detection so a reviewer can see what the regime change actually changed.
| Detection | CUSUM | Δ load | Kind | MAPE before | MAPE after |
|---|---|---|---|---|---|
| 2026-05-20 02:00 | 178.41 | +32.9% | peak_demand | 35.65% | 32.75% |
| 2026-02-12 07:00 | 95.73 | -13.6% | reduced | 35.98% | 28.48% |
| 2026-03-19 10:00 | 90.92 | +17.3% | elevated | 32.02% | 27.57% |
| 2026-02-16 07:00 | 87.30 | -16.8% | reduced | 28.48% | 29.18% |
| 2025-12-03 04:00 | 82.13 | +14.3% | elevated | 43.26% | 37.71% |
Generated 2026-05-30 04:15 GMT+0000 by php artisan loadlens:find-regime over 5,035 hours of history.
Known issues · roadmap
What is currently broken or missing
Listed here so a reviewer can audit them before we publish a Phase I claim.
-
[1]
NOAA weather ingestion is not populating
temperature_f— all rows currently NULL, which collapses the weather-aware regression baseline to zero predictions and disables HEAT_DOME / COLD_SNAP regime stratification. Fix is in flight; the eval reports show weather coverage explicitly so this stays visible. - [2] Demonstration data is transmission-scale (PJM Interconnection), not cooperative-scale. Ensembles are tuned for the noisier distribution-level signal where weekly persistence is weaker. Cooperative-scale evaluation requires a pilot AMI feed (in active outreach).
- [3] Cross-region (geographic) holdout uses the same ISO until a non-PJM dataset is wired. The eval slot is in place; the data is not.
- [4] Operational dollar-impact metric not yet computed. $/MWh saved under simulated dispatch is the headline value claim for the SBIR application; placeholder is the calibrated probabilistic-forecast story.
- [5] Static split-conformal intervals don't survive the multi-month distribution shift between calibration window and holdout window — visible as under-coverage and asymmetric pinball loss in the table above. v2 will replace static calibration with a rolling / online conformal layer that re-fits as the operating regime drifts.
Reproducibility
Run it yourself
ssh ce-prod "cd /var/www/vhosts/champlinenterprises.com/loadlens.champlinenterprises.com \
&& /opt/plesk/php/8.4/bin/php artisan loadlens:eval --pre-registered=v1"
The command refuses to run if the on-disk protocol version drifts from --pre-registered=. This page renders whatever storage/app/eval/latest.json contains.