Biostatistics

E-values acquire the missing pieces — but not the validation

Four preprints close the main gaps in anytime-valid inference for sequential trials. None have touched a clinical dataset, and one mechanism won't survive a DMC charter review.

E-values, anytime-valid inference & sequential statistics
Regulatory
Methodology Frontier

For the last five years, the standard objection to e-values in regulated trials has been that they were a testing framework in search of an estimation framework, an alpha-wealth framework that bled power, and a martingale apparatus that quietly fell apart the moment outcomes arrived asynchronously. Four arXiv preprints landed in a single window addressing exactly those three gaps, plus a fourth on tail-model diagnostics. None of them have been peer-reviewed, none have been run against a real clinical trial, and at least one contains a mechanism that will not survive a DMC charter review in its current form. But as a methodology cluster, this is the week the toolkit stopped being notional.

The four pieces, and what each actually fixes

The estimation gap is addressed by the ME-estimator, defined as the parameter value that minimises an e-statistic — a direct generalisation of MLE, which minimises negative log-likelihood. The authors prove consistency, almost-sure convergence rates, and asymptotic normality in the bounded-mean setting. They also note, with some care, that standard betting strategies may not be fully efficient, which is the polite version of saying the efficiency story relative to MLE at realistic Phase 3 sample sizes is still open.

The power-leakage gap in online FDR is addressed by SCORE, which exploits the inequality I(y ≥ 1) ≤ y − (y−1)₊ to refund e-value mass above the rejection threshold back into the alpha-wealth budget. SCORE-LOND, SCORE-LORD, and SCORE-SAFFRON are proven to strictly dominate their originals in power while preserving finite-sample FDR. The same paper also introduces a retroactive alpha-wealth update — the most recent decision is used twice, once for reward/loss and once to refresh past wealth. The authors themselves flag this as warranting scrutiny, and they are right to: a procedure that revises prior wealth using later decisions is not obviously a pre-specifiable test in the sense a regulator means when they say “pre-specifiable”.

The asynchronous-outcomes gap is addressed by a design-based confidence sequence paper that handles staggered entry and delayed outcomes simultaneously. The technical move is genuinely clever: IPW treatment-effect errors do not form a martingale under any single filtration, but arm-specific IPW errors are martingales under arm-specific event-time filtrations, and a union bound across arms gives a valid joint confidence sequence — one that can beat the classical design-based variance upper bound. The paper also characterises the AIPW augmentations that preserve the per-arm martingale property. The motivating examples, however, are A/B tests on tech platforms, not adaptive oncology trials, and the union bound is potentially conservative at Phase 3 interim timings against an O’Brien-Fleming boundary nobody has yet simulated.

The fourth piece, the EVA2025 winning entry on martingale testing for Peaks-Over-Thresholds extrapolation, is environmental statistics, but the framing — extrapolation validity as a sequential testing game — is directly portable to rare adverse-event tail modelling.

What biometrics teams should actually do with this

Treat this as a methodology-frontier cluster, not a submission-ready one. The combination of ME-estimation, SCORE-style online FDR, and design-based confidence sequences is, in principle, what you would need to run an anytime-valid adaptive platform trial with continuous safety monitoring and rolling biomarker hypotheses. In practice, no regulator has commented publicly on e-value stopping rules, the ICH E9(R1) estimand vocabulary has not been mapped onto the e-process world, and a fair head-to-head against Lan-DeMets at typical interim timings has not been run by anyone.

One terminology hazard worth pre-empting in internal discussions: “e-value” in this Vovk–Shafer anytime-valid sense is a different object from the VanderWeele–Ding sensitivity-analysis e-value, and the field is doing nothing in particular to disambiguate. Expect at least one cross-talk incident per methodology review committee.

Protocol read: The toolkit is now roughly complete on paper, which is the right moment to start reading, not the right moment to start specifying. The retroactive wealth update and the union-bound conservatism are the two seams most likely to tear under regulatory load.

What to do now:

Pre-read the SCORE paper specifically for the retroactive alpha-wealth mechanism; flag it before any internal proposal cites SCORE as “pre-specifiable”.
Commission an internal simulation comparing the design-based confidence sequence width against O’Brien-Fleming and Lan-DeMets at your typical interim timings before claiming efficiency.
In any methodology forum, disambiguate Vovk–Shafer e-values from VanderWeele–Ding e-values on first mention; the collision is now operational, not academic.

E-values acquire the missing pieces — but not the validation

Read next

EMA starts drawing lines around external control arms

Dataset-JSON, USDM and IDMP arrive in the same quarter

Pfizer's $10B Innovent bet meets the China-data reckoning