The Biometrics Weekly

The Measurement Layer Gets Audited

Five PRO/COA methodology papers and an EMA workshop on Geographic Atrophy endpoints land in the same week — all pointing at thresholds, crosswalks, and surrogates that biometrics teams have been treating as settled.

  • Patient-reported & clinical outcome assessments
  • Regulatory
  • Methodology Frontier

A simulation study in the Journal of Biopharmaceutical Statistics takes a hard look at the anchor-based minimal clinically important difference — the threshold that quietly underpins most responder analyses, PRO-based sample-size calculations, and labeling claims — and finds it more bias-prone than its near-universal use suggests. The paper stress-tests the method across anchor-target correlation, measurement error in the anchor, and distributional assumptions about responders, and characterises how the estimated MCID drifts from the true threshold. It does not propose a corrective estimator. It does make clear that the number sponsors routinely paste into Section 9 of the SAP is a point estimate with a non-trivial standard error nobody is reporting.

That paper does not arrive alone. In the same window, four other peer-reviewed studies push on adjacent parts of the same scaffolding, and EMA convenes a regulatory workshop on the endpoint question in an indication where two products are already approved on the contested measure. Read together, the week looks less like a coincidence and more like a beat — the measurement layer is being audited.

Crosswalks that are useful, and the ways they fail

A multicenter prospective cohort study in the Journal of Clinical Epidemiology applies equipercentile linking across HAMD-17, HAMD-6, QIDS-SR16, and PHQ-9, and confirms the operationally awkward result: correlations are strong, score-level differences are systematic, and raw values are not interchangeable without transformation. For anyone building an external control arm in MDD, running a network meta-analysis, or pooling historical data across protocols that picked different primary instruments, this is a usable tool — and one whose limits the authors are honest about. Equipercentile linking is sample-dependent; transportability to treatment-resistant or pediatric populations is not established. There is also a quieter point worth surfacing in your SAP discussions: the clinician-rated vs self-report split (HAMD vs QIDS-SR16/PHQ-9) is not just measurement noise, it has estimand consequences, because the two rater types target subtly different constructs and handle informative missingness differently.

A companion paper in the same journal links three versions of the Physical Performance Test onto the PROMIS Physical Function T-score metric, extending IRT-based PRO calibration into the PerfO domain. For geriatric, musculoskeletal, and rehabilitation programmes mixing self-report and performance-based assessments, this finally puts the two on a common scale. The same caveat applies — derivation-sample dependency — and analysis specifications will need to distinguish raw instrument scores from linked PROMIS-equivalent scores rather than treating them as one variable.

On the tolerability side, a study in Clinical Trials validates a single-item PRO for overall side-effect impact against clinician-reported AEs and patient-reported global health — a direct response to regulatory language calling out overall side-effect burden as a tolerability domain, and a recognisable Project Optimus-era artifact. Convergent validity against CTCAE is reassuring; whether one item carries enough content validity and sensitivity to support a labeling claim is a separate question, and not one the abstract resolves.

The pediatric piece is the upstream one. A systematic review catalogues how — or whether — children and young people themselves were included in developing pediatric core outcome sets, as opposed to adult proxies. COS decisions sit upstream of estimands, CRFs, and ADaM specs; when the construct is wrong, downstream rigor cannot recover it. Relevant to anyone working under ICH E11(R1) or contributing to PFDD-style endpoint development.

The regulatory hook

None of this would be more than methodological housekeeping if regulators were not actively re-opening the same questions. EMA’s European Medicines Regulatory Network has published the agenda for a workshop on Geographic Atrophy endpoints, explicitly aimed at functional GA endpoints, retinal imaging, and the translatability of endpoints into patient benefit. The subtext is not subtle: FDA approved pegcetacoplan and avacincaptad pegol on lesion-growth rate, the functional-benefit question was not resolved at approval, and EMA is convening national competent authorities to develop a coordinated position. Sponsors with GA programmes should plan for the possibility that EU acceptability diverges from the US precedent, with sensitivity analyses or functional co-primaries appearing in scientific advice.

The methodological papers and the GA workshop are not the same story, but they rhyme. Thresholds, crosswalks, single-item instruments, surrogate endpoints — each is a place where applied practice has been running ahead of formal validation, and each is now getting examined.

Protocol read: The PRO/COA layer is being stress-tested in public, and “we used the standard instrument with the published MCID” is no longer a defensible stopping point in an SAP or briefing document. Expect reviewers to ask which threshold, derived in which sample, with what uncertainty.

What to do now:

  • Audit anchor-based MCIDs in active SAPs against the bias drivers in the JBS simulation; flag any where anchor-target correlation or anchor measurement error is weak.
  • For MDD pooled analyses or external controls, treat the new equipercentile crosswalks as inputs to sensitivity analyses, not deterministic conversions.
  • If you have a GA programme, pressure-test the SAP for a scenario in which EU scientific advice asks for a functional co-primary or sensitivity endpoint alongside lesion growth.