The Biometrics Weekly

FDA's AI Pilot RFI and a Flurry of Methods Papers Force a Useful Question

A live FDA comment deadline, a serious AI sample-size preprint, and evidence that AI models overestimate their own generalizability by ~20% — not all pointing the same direction.

  • AI/ML in clinical development
  • Biostatistics
  • Regulatory
  • AI/ML

The comment window for FDA’s AI-Enabled Optimization of Early-Phase Clinical Trials Pilot Program RFI closes June 29 (Docket No. FDA-2026-N-4390). The extension was granted at stakeholder request, which makes this a rare moment where a well-crafted comment on adaptive dose-finding, trial simulation, or go/no-go frameworks could actually shape the evaluation criteria. If your organization has an early-phase AI capability and has not yet looked at the docket, now is the time.

The more substantive methodological question is what defensible AI contributions to registrational trial design actually require. An arXiv preprint (2605.23246) takes that question seriously, proposing that prospectively validated, locked AI-derived prognostic covariates could feed covariate adjustment to reduce sample sizes in registrational RCTs. The authors walk all seven steps of FDA’s draft AI credibility framework and demonstrate the concept with an Alzheimer’s Disease example — a setting where endpoint variance is a perennial obstacle. The preprint walks through the regulatory framework without committing to a headline reduction figure; the mechanism is conceptually familiar (variance reduction via ANCOVA logic), but the AI-model qualification layer is not. Treat it as a serious methodological proposal, not a deployable protocol.

The necessary counterweight is an ASA BIOP commentary citing empirical findings across 13 passively collected clinical datasets: medical AI models relying on hidden data-acquisition biases overestimated external performance by ~20% on average. Internal predictive performance, in other words, is not evidence of causal validity. The appropriate role for AI is as richer input to formal causal frameworks — not as a standalone evidence generator.

That distinction is the sorting question worth applying to every AI contribution under evaluation: does this touch inference, or only speed? Veristat’s InStat™ — AI agents translating biostatistician specs into inputs for pre-validated statistical engines, with LLM-generated code explicitly excluded from regulatory-facing outputs — is the clearest current example of the speed side of that ledger. The regulatory and methodological bar for the inference side is categorically higher, and the preprint above is the most serious current attempt to map what crossing it actually requires.

Protocol read: The methodology and the regulatory pathway are both starting to take shape, but most AI offers currently being pitched touch inference — where the evidentiary bar is high — without the validation infrastructure that bar requires. Use “inference vs. speed” as the disciplined sorting question for every internal proposal.

What to do now:

  • File a comment on Docket FDA-2026-N-4390 before June 29 if your organization has any early-phase AI capability; the docket is unusually shapeable right now.
  • Apply the inference-vs-speed sorting question to every AI tool in evaluation; inference-side claims must come with prospective model qualification and external validation, not retrospective benchmarks.
  • Read the arXiv preprint and the BIOP “AI patterns” commentary together before any AI-prognostic-covariate proposal lands in a registrational SAP.