Methodology · Pre-Registered Evaluation

We measure ourselves the way
a reviewer would.

LoadLens uses a pre-registered evaluation protocol — the test sets, baselines, metrics, and stratifications below were declared before model selection and tuning. The numbers on this page come directly from storage/app/eval/latest.json, written by php artisan loadlens:eval. There are no hand-edited numbers in the public chrome.

EVAL_PROTOCOL.md → · Last run: 2026-07-12 03:30 GMT+0000 · Protocol: v1 · Commit: 8f12014

TEMPORAL PRIMARY

PJM Region

3,909 holdout hours · calibration window 2026-01-18 -> 2026-02-01 · weather coverage 0/82 rows

Model	n	MAPE %	RMSE MW	MAE MW	Cov [q10,q90]
`B_PERSIST_168` baseline	3,909	9.21	12,790	9,321	—
`B_SEASONAL_NAIVE` baseline	3,909	9.32	12,702	9,309	—
`B_HOUR_DOW_MEAN` baseline	3,909	9.30	12,699	9,297	—
`B_LINEAR_TEMP` baseline	0	—	—	—	—
`ENSEMBLE_LIVE`	3,909	20.23	27,140	20,591	0.163
`ENSEMBLE_ADVANCED`	3,909	16.27	21,922	16,497	0.148

Coverage target: 0.80 (the [q10, q90] interval should contain 80% of realized loads). Calibrated via split-conformal prediction on the warmup-tail residuals.

MAPE by regime

Regime	`B_PERSIST_168`	`B_HOUR_DOW_MEAN`	`ENSEMBLE_LIVE`	`ENSEMBLE_ADVANCED`
BASELINE	9.30% (n=2688)	9.05% (n=2688)	19.88% (n=2688)	15.85% (n=2688)
HEAT_DOME	—	—	—	—
COLD_SNAP	—	—	—	—
WEEKEND	8.09% (n=1149)	9.81% (n=1149)	21.13% (n=1149)	17.22% (n=1149)
HOLIDAY	22.93% (n=96)	12.89% (n=96)	22.77% (n=96)	18.50% (n=96)
RAMP	9.10% (n=978)	9.20% (n=978)	8.94% (n=978)	8.07% (n=978)

What these numbers mean

An honest read

On PJM regional load, simple naive baselines like persistence-168 (load from same hour one week ago) achieve ~3% MAPE. PJM aggregates millions of customers across thirteen states; at that scale weekly periodicity is overwhelmingly stable, and any model has to clear a high bar to add value.

Our adaptive ensembles currently sit at 14–18% MAPE on the same data. That gap is real and we're not papering over it. The ensembles were tuned on smaller, noisier load profiles — exactly the rural cooperative / distribution-level signals where weekly persistence breaks down. The pre-registered eval here exposes that the demonstration data is too easy for baselines and too misaligned with the production target.

The probabilistic story is currently mixed. Earlier short-window evals showed the advanced engine's split-conformal intervals well-calibrated near the 0.80 nominal target. The longer Q1–Q2 2026 holdout above shows under-coverage with a systematic asymmetry (pinball loss heavily skewed at q10) — clear evidence of distribution shift between the 14-day calibration window and the 90+ day holdout. Static split-conformal can't absorb that, and we're not going to pretend it does. v2 will replace it with rolling / online conformal that re-fits as the operating regime drifts; that's the directly-addressable next step.

The grant claim is not "we beat industry baselines on transmission-level data." It is "we ship a falsifiable eval, surface our own failure modes openly (including this one), publish a reproducible pipeline, and will demonstrate the adaptive advantage on real cooperative AMI data once a pilot is signed." Every cell on this page is from a single command; no number on it has been hand-edited.

Regime detection · receipts

CUSUM change-points in real history

Sweep of the trailing 209 days of PJM Region load history at threshold 4. Each row is a statistically significant statistical change-point in the load signal. The "before / after" columns score 48-hour windows on either side of the detection so a reviewer can see what the regime change actually changed.

Detection	CUSUM	Δ load	Kind	MAPE before	MAPE after
2026-05-20 02:00	178.41	+32.9%	peak_demand	35.65%	32.75%
2026-07-02 23:00	95.85	+34.5%	peak_demand	37.39%	38.72%
2026-02-12 07:00	95.73	-13.6%	reduced	35.98%	28.48%
2026-03-19 10:00	90.92	+17.3%	elevated	32.02%	27.57%
2026-02-16 07:00	87.30	-16.8%	reduced	28.48%	29.18%

Generated 2026-07-17 04:15 GMT+0000 by php artisan loadlens:find-regime over 5,035 hours of history.

Known issues · roadmap

What is currently broken or missing

Listed here so a reviewer can audit them before we publish a Phase I claim.

[1] NOAA weather ingestion is not populating temperature_f — all rows currently NULL, which collapses the weather-aware regression baseline to zero predictions and disables HEAT_DOME / COLD_SNAP regime stratification. Fix is in flight; the eval reports show weather coverage explicitly so this stays visible.
[2] Demonstration data is transmission-scale (PJM Interconnection), not cooperative-scale. Ensembles are tuned for the noisier distribution-level signal where weekly persistence is weaker. Cooperative-scale evaluation requires a pilot AMI feed (in active outreach).
[3] Cross-region (geographic) holdout uses the same ISO until a non-PJM dataset is wired. The eval slot is in place; the data is not.
[4] Operational dollar-impact metric not yet computed. $/MWh saved under simulated dispatch is the headline value claim for the SBIR application; placeholder is the calibrated probabilistic-forecast story.
[5] Static split-conformal intervals don't survive the multi-month distribution shift between calibration window and holdout window — visible as under-coverage and asymmetric pinball loss in the table above. v2 will replace static calibration with a rolling / online conformal layer that re-fits as the operating regime drifts.

Reproducibility

Run it yourself

ssh ce-prod "cd /var/www/vhosts/champlinenterprises.com/loadlens.champlinenterprises.com \
  && /opt/plesk/php/8.4/bin/php artisan loadlens:eval --pre-registered=v1"

The command refuses to run if the on-disk protocol version drifts from --pre-registered=. This page renders whatever storage/app/eval/latest.json contains.

We measure ourselves the waya reviewer would.