The Biometrics Weekly

AI Diagnostic Submissions Finally Have a Principled F1 Toolkit

Two peer-reviewed methods papers give AI/ML SaMD analysts citable F1 inference tools — arriving precisely as FDA and MHRA name the metrics gap officially unresolved.

  • Diagnostics & device biostatistics (SaMD)
  • AI/ML
  • Biostatistics
  • Regulatory

If you have written a statistical analysis plan for an AI-based diagnostic device in the last several years, you have almost certainly pre-specified an F1 score threshold with no principled method for justifying the sample size behind it. That is not incompetence — it is a description of the field. Until now, the frequentist machinery for confidence intervals, hypothesis tests, and prospective power calculations around F1 and Fβ did not exist in citable, usable form.

Two papers, published months apart, fix that in complementary directions.

A Statistics in Medicine paper (April 2026) introduces psF1, covering single-classifier inference and head-to-head Fβ comparison, using exact distributions for small samples and asymptotic approximations at scale. The Fβ generalisation matters clinically: setting β > 1 up-weights recall over precision — the correct move for screening applications where missing a case costs more than a false alarm. The package installs from GitHub rather than CRAN, and its behaviour at the extreme prevalences typical of rare-disease screening is an open question worth checking before it anchors a submission SAP.

When adjudicated labels are unavailable — a common SaMD reality — a Journal of Biopharmaceutical Statistics paper (January 2025) fills the gap with a Bayesian latent class model that estimates true disease status from two imperfect classifiers and produces full posterior distributions over F1 differences without a definitive reference. The method’s load-bearing assumption — conditional independence of the two tests’ errors given true disease status — is routinely violated when both tests share a common failure mode. Published validation on labelled real-world datasets is not yet available, a gap worth tracking before the method is leaned on at submission.

The regulatory framing sharpens the timing. FDA’s CDRH OSEL AI Program page explicitly lists “lack of standardised metrics for performance estimation, reference standards, and uncertainty quantification” as one of six unresolved regulatory science gaps. MHRA’s AI Airlock pilot (closed April 2025) has produced non-binding simulation reports on hallucination evaluation and post-market surveillance, with Phase 2 — covering advanced cancer diagnostics and eye disease detection — funded through Summer 2026. Neither constitutes a compliance deadline, but the directional signal is consistent. The window between “methods exist” and “reviewers expect them” is usually shorter than sponsors anticipate.

Protocol read: AI-diagnostic SaMD submissions can now justify F1-based sample sizes and head-to-head comparisons with peer-reviewed methodology rather than vibes — but the two new tools carry assumptions that will not survive a careful reviewer without pre-specified diagnostics.

What to do now:

  • Replace ad-hoc F1 threshold choices in SaMD SAPs with psF1-based confidence intervals and prospective power calculations; flag the extreme-prevalence caveat where applicable.
  • Where adjudicated labels are unavailable, use the JBS Bayesian latent-class framework but pre-specify a sensitivity analysis that relaxes the conditional-independence assumption.
  • Track MHRA AI Airlock Phase 2 outputs through Summer 2026; validation expectations are converging faster than the formal compliance deadlines suggest.