The Biometrics Weekly

Regulators Name the SaMD Statistical Gaps; The Journals Start Filling Them

MHRA's AI Airlock and FDA CDRH have now publicly enumerated the same regulatory-science gaps for AIaMD — and a wave of 2025–26 diagnostic-accuracy papers is racing to make those gaps citeable.

  • Diagnostics & device biostatistics (SaMD)
  • AI/ML
  • Regulatory
  • Biostatistics

On 16 October 2025 the MHRA published its AI Airlock pilot programme report and three simulation workshop reports — on explainability, hallucination evaluation, and post-market surveillance for AIaMD. On 8 April 2026 the agency confirmed multi-year funding and a Phase 2 cohort of seven devices spanning clinical note-taking, cancer diagnostics, eye disease detection, and obesity support. The sandbox is no longer a one-off experiment; it is becoming durable UK regulatory infrastructure.

Read it next to FDA CDRH’s AI Program page and the convergence is hard to miss. CDRH names six regulatory-science gaps: limited labelled data, bias detection and mitigation, performance metrics and uncertainty quantification, evaluation of continuously learning algorithms, emerging clinical AI use cases, and post-market monitoring. The Airlock workshops target three of those six directly. Two regulators, the same list. That is the story.

Where the literature is catching up

The biometrics literature has, on cue, started producing the tools. The most operationally useful is psF1, published in Statistics in Medicine in April 2026: a unified framework for confidence intervals, hypothesis tests, and prospective sample-size calculation around F1 and Fβ scores, with exact small-sample distributions, asymptotic approximations, and an open R implementation. For anyone who has had to defend an AI classifier’s pre-specified performance threshold in a submission without a citeable inference machinery for the metric, this is the gap closing in real time. The Fβ generalisation also lets you up-weight sensitivity in screening contexts without inventing your own justification.

The same gap is being attacked from other angles. A Bayesian framework in JBS compares F1 scores between classifiers when no gold standard exists — exactly the latent-class problem CDRH flags under “absence of reference standards.” A Statistics in Medicine paper extends concordance correlation to a three-level GLMEM with fiducial CIs, with explicit framing for algorithm-vs-reader agreement. Three-class Youden index methods with verification-bias correction and covariate adjustment extend ROC machinery into the multi-class settings clinical taxonomy actually has. A J Clin Epi commentary calls outright for methodological guidance where no reference standard exists. Two years ago a sponsor citing F1 in an SaMD submission would have leaned on a textbook definition; now there is a peer-reviewed, software-backed inference framework to point at.

What the FDA actually did this quarter

Methods papers are not the only signal. On 1 June 2026 FDA finalised a Class II De Novo classification for a real-time ultrasound anatomy visualisation and labeling device — an AI imaging tool — under 21 CFR Part 868 special controls, retroactive to October 2022. That establishes a 510(k) predicate and locks in performance, software-validation, and labelling requirements for every follow-on anatomy-recognition device. The Medasense PMD-200 pain-measurement classification did the same for algorithmic physiological signal processing. The pattern: De Novo to calibrate risk class, special controls to fix the evidentiary bar, then scale via 510(k). Sponsors of similar tools should be reading the special controls as the operative standard, not the device-specific decision.

The MHRA reports are explicitly non-binding and the methodology papers are mostly fresh — uptake will depend on real-data validation and reviewer familiarity. The cross-regulator alignment, however, makes it likelier than not that uncertainty quantification, drift monitoring, and bias evaluation will be expected in AIaMD submissions within the next guidance cycle. That is no longer speculative.

Protocol read: The regulators have publicly itemised the AIaMD statistical gaps and the literature is producing citeable, software-backed methods against the same list — teams that wait for formal guidance will be writing SAPs against tools their reviewers already know.

What to do now:

  • Pre-specify AI classifier performance thresholds with a citeable inference framework (psF1 for F1/Fβ; latent-class Bayesian methods where no gold standard exists) rather than point estimates.
  • Map current SaMD validation packages against CDRH’s six named gaps; flag drift-monitoring and bias-evaluation as the two least defensible today.
  • Treat the Class II De Novo special controls for ultrasound anatomy labeling and PMD-200 as the operative validation benchmark for any anatomy-recognition or physiological-signal AI tool in your pipeline.