F-Score Metrics Are Finally Getting the Statistical Plumbing They Deserve

F1 and Fβ scores are everywhere in medical AI benchmarking. What they have not had, until recently, is the statistical scaffolding that serious regulatory submissions require: pre-specified thresholds, confidence intervals, power calculations, and principled methods for comparison. A cluster of new papers suggests the field is beginning to close that gap — though not all at the same pace, and not without some internal disagreement about whether the metric is worth saving at all.

The most immediately practical contribution is psF1, published in Statistics in Medicine (April 2026). It provides a unified framework for confidence interval estimation, hypothesis testing, and — critically — prospective sample size planning for single-sample and comparative F1/Fβ inference. It covers both exact small-sample distributions and large-sample asymptotic approximations, which means it is usable across the study sizes teams actually encounter in diagnostic AI validation. Code is available as an R package on GitHub. For biostatisticians writing SAPs or power justifications for SaMD validation studies, this is a citable, peer-reviewed tool where there was effectively nothing before. That is not a small thing.

The harder problem: when ground truth is missing

The Bayesian comparator in Journal of Biopharmaceutical Statistics (January 2025) tackles a thornier issue: how do you compare F1 scores between two classifiers when you have no adjudicated reference standard? This is not a theoretical edge case. In real-world clinical data, ground truth is routinely imperfect, expensive, or structurally unavailable. The latent class approach borrows strength across imperfect tests and returns a full posterior distribution over the F1 difference — an advantage over point-estimate frequentist comparisons when sample sizes are modest, which in clinical AI studies they often are. Regulatory applicability will depend on sponsor and reviewer comfort with Bayesian framing, but the problem it solves is real.

The dissent: does Fβ even belong in the clinic?

Not everyone is building on the Fβ foundation. A separate JBS paper argues that Fβ scores — borrowed wholesale from the ML classification literature — are poorly suited to medical diagnostics, primarily because they omit true negatives entirely. Specificity, which matters enormously in low-prevalence screening contexts, simply does not appear in the Fβ formula. The authors propose alternative accuracy measures; the practical question is whether these will achieve any regulatory traction, given that novel summary statistics in diagnostics tend to need explicit FDA or EMA endorsement before sponsors will risk using them in submissions.

A fourth paper extends the conversation further still, proposing an area under F-score curves (AUF) — a threshold-integrated summary measure analogous to AUROC, motivated by the known weakness of AUROC in class-imbalanced settings. The idea is sound in principle; whether AUF offers enough over precision-recall AUC to justify adoption in a regulatory context is an open question the paper does not appear to fully resolve.

Taken together, these papers reflect a field in active negotiation over whether to fix F-scores, extend them, or partially replace them. For biometrics teams, the immediate action is narrow but concrete: psF1 is worth evaluating now for any AI/ML diagnostic validation study requiring pre-specified performance thresholds and sample size justification. The broader metric debate is worth tracking, but is not yet submission-ready.

The Protocol

F-Score Metrics Are Finally Getting the Statistical Plumbing They Deserve

Sources