The Biometrics Weekly

CRT inference: three papers, one estimand problem

The cluster-randomized headline number is doing less work than it looks. May–July 2026 brought a closed-form bias proof, a reporting indictment, and a workaround that sidesteps the ICC entirely.

  • Cluster-randomized & hierarchical trial methods
  • Regulatory
  • Methodology Frontier

A closed-form derivation in Statistics in Medicine this May established that the standard “constant treatment effect” estimator in stepped-wedge cluster-randomized trials with multiple interventions converges, under exchangeable within-cluster correlation, to a design-dependent weighted average of exposure-time-specific effects — with weights the authors note “may not correspond to a natural estimand of interest.” That is not a heuristic warning about model misspecification. It is a probability limit. Paired with a new JCE method paper on effective sample size and a May 2026 Editors’ Choice commentary by Granholm on ICC under-reporting, the CRT literature has spent a quarter quietly reframing itself: clustering is no longer just a variance problem, it is an estimand-and-reporting problem.

The estimand the SAP didn’t name

Take the SW-CRT result first, because it is the sharpest. When effects vary across exposure time — a near-universal feature of pragmatic and implementation trials, where clusters learn the intervention — the constant-effect estimator targets a weighted average whose weights are a function of the design matrix, not of clinical interest. Simulations across increasing, decreasing, and non-monotone effect curves show biased estimation and poor coverage of the exposure-time-averaged effect. The authors evaluate two remedies: a time-varying fixed-effect model (a separate effect at each exposure time) and a random treatment effect model (deviations around an overall mean). Applied to the PONDER trial of EHR-based interventions, all three models concur on a null — but differ materially in precision. Model choice is an estimand choice even when the qualitative read is invariant, which is precisely the situation in which steering committees stop paying attention.

The ICH E9(R1) framing is uncomfortable. If the SAP’s primary estimator targets a design-dependent weighted average, the estimand attribute “population-level summary” is, strictly, the weights the design happened to produce. That is rarely what protocols claim. For pragmatic SW-CRTs heading into HTA or label-supporting territory, pre-specifying the exposure-time-averaged effect — and the model that targets it — is the defensible move.

ESS without the ICC you never had

The JCE methods paper attacks an adjacent problem. Conventional ESS calculations require an ICC, and ICCs are, in the field’s polite phrasing, frequently unavailable. The authors propose back-calculating ESS directly from the standard error produced by a clustering-adjusted analysis, anchoring information yield in the analysis output rather than a prior nobody trusts. For meta-analysts performing inverse-variance weighting across heterogeneous CRTs, this is the more honest currency. Two cautions the paper does not fully resolve: behaviour with small cluster counts (where adjusted SEs are themselves unstable and Kenward–Roger versus Satterthwaite corrections move inference materially), and behaviour under correlation misspecification (exchangeable assumed, nested-exchangeable true) — the SE-derived ESS inherits whatever bias lives in the SE it was computed from.

Granholm’s commentary is the connective tissue. ICCs are routinely not reported per outcome in CRT publications, which makes downstream ESS adjustment impossible and biases pooled estimates toward over-precision. The SUN program protocol published in Contemporary Clinical Trials in July, a school-level CRT, is a contemporaneous case in point — competently designed, no ICC priors in the summary. The fix sits downstream of the 2025 SPIRIT/CONSORT updates and is unglamorous: per-outcome ICCs in the CSR, not just for the primary.

What this actually changes

None of this is a regulatory event. It is, collectively, a methodological consensus forming in real time that the defaults — constant-effect SW estimators, ICC-free ESS reporting, ICC-based meta-analytic weighting — are softer than they look. For biometrics teams running or pooling CRTs, the immediate work is in the SAP: name the estimand the chosen model actually targets, pre-specify the time-varying or random treatment effect model when a learning curve is plausible, and budget for per-outcome ICC reporting in TLF shells. The papers do not yet supply tidy regulatory cover. They do supply citable reasons to stop defaulting.

Protocol read: The constant-effect SW estimator is now formally a biased estimator of any natural estimand under time-varying effects — treat that as the new methodological baseline, not as a frontier concern.

What to do now:

  • Audit active SW-CRT SAPs for whether the primary model assumes a constant effect, and pre-specify the exposure-time-averaged estimand and matching estimator where a learning curve is plausible.
  • Add per-outcome ICC reporting (primary and key secondaries) to CRT CSR shells and TLF specifications.
  • Defer adopting SE-derived ESS as a primary reporting metric until small-cluster and correlation-misspecification behaviour is characterised — but use it as a sensitivity check now.