Data Management

Three percent of Dryad datasets fail a duplication scan — and that's just the pilot

Three converging items this week reframe source-data trust as the binding constraint on the preclinical, biomarker, and repository data that ultimately feeds regulatory submissions.

Data quality, integrity & cleaning
Biostatistics
Quality / GxP

Eighteen of 600 biomedical datasets on the Dryad repository — about 3% — showed “serious-looking” copy-paste duplications when a volunteer group ran automated detection software against them, with one paper already retracted before the scan even began. The next phase targets roughly 24,000 Excel-based Dryad datasets, which will tell us whether the 3% rate holds, climbs, or collapses at scale. None of this is GCP data. All of it is upstream of decisions your sponsors make.

The three cases Derek Lowe walks through are instructive precisely because they aren’t uniform. A 2016 Cell paper on gut microbiota and Parkinson’s contains duplications that materially move the reported results. A PLOS Genetics paper on evolved toxin resistance has anomalies the authors cannot readily explain. A 2017 Nature Communications clonal-fish study turned out to be a file-merging error producing four-fold repeating values — the authors acknowledged it and corrected the dataset. Fraud, ambiguity, and honest merge error, all flagged by the same tool. The signal worth tracking isn’t the 3% headline; it’s the distribution of author responses when the email arrives.

The same week, the labels themselves are wrong

While the Dryad scan was running, a Science paper genotyped 611 tissue samples across 341 MMRRC mouse strains and found nearly half did not match their labeled strain identity. Failure modes include incomplete congenicity, residual third-strain contributions, insufficient inbreeding, and naming-convention limits baked into ICSGNM itself — the nomenclature cannot encode what’s actually in the animal. The proposed fix is a standardized MMRRC Strain GQC Report capturing strain type, allele of interest, primary and secondary backgrounds, replicability, and flags for Cre or fluorescent-protein transgenes; over 100 strains have one. Methods papers on whole-cage randomization and Bayesian sample-size reduction assume the animals are what the cage card says. Half the time, they aren’t.

A third item rounds out the week: formal Matters Arising rebuttals to the Nature Medicine brain and NEJM cardiovascular microplastics-in-human-tissue papers argue that pyrolysis GC/MS can generate polyethylene-like fragments from endogenous long-chain fatty acids in lipid-rich tissue, and that the original work lacked procedural blanks, background correction, and contamination safeguards during surgical collection. The dispute is about method validity, not the biology. It’s the bioanalytical-method-validation conversation playing out in public, badly.

Why a data-management desk cares about preclinical and academic data

Because it reaches you. Preclinical readouts from mislabeled animals seed dose-finding and translational models. Biomarker assays without procedural blanks become endpoint assays. Public repository data ends up in external control arms and RWE submissions. And the regulatory tail of this is not theoretical — FDA’s proposed market withdrawal of Tavneos over pivotal-trial data-integrity failures and the T3D Therapeutics Alzheimer’s matter are the GCP-side version of exactly the same root causes: audit-trail gaps, source-data trust, and no one running the duplication check until it was far too late.

The defensive posture is straightforward in principle and unevenly practiced. Automated duplication detection is cheap; most sponsors do not run it routinely against incoming preclinical or biomarker datasets, and almost no one runs it against the public datasets they cite as comparators. Procedural blanks and background correction are bioanalytical hygiene; reviewers should be asking for them by default on any near-LOQ assay supporting an endpoint. Strain-identity verification for animal models is a one-time genotyping cost set against the price of a translational program built on the wrong genetic background. None of this is novel methodology. It is ALCOA+ applied where ALCOA+ was never enforced.

The Dryad volunteers will scan 24,000 datasets next. Someone in your organisation should be deciding now whether the preclinical and biomarker datasets supporting your active programs would survive the same scan — before a reviewer, a journalist, or a competitor decides for you.

Protocol read: The Dryad 3% is a floor, not a ceiling, and the same class of source-data failures are already showing up in withdrawn approvals. Treat preclinical and biomarker datasets with the integrity controls you’d apply to a CRF.

What to do now:

Run an automated duplication scan against the preclinical and biomarker datasets supporting your active programs; document the result either way.
Require procedural blanks, background correction, and reference standards on any near-LOQ bioanalytical assay feeding a regulatory endpoint, and have biometrics review the controls before lock.
For animal-model programs, request strain-identity genotyping or the equivalent of an MMRRC GQC Report from your CRO before relying on translational readouts.

Three percent of Dryad datasets fail a duplication scan — and that's just the pilot

Read next

EMA starts drawing lines around external control arms

Dataset-JSON, USDM and IDMP arrive in the same quarter

Pfizer's $10B Innovent bet meets the China-data reckoning