A bioRxiv link to published paper is created automatically, usually within two weeks of journal publication, once bioRxiv’s matching system confirms that the preprint and the paper share a title, author list, and DOI. A newly published dataset, PreprintToPaper, has now mapped this process across 145,517 bioRxiv preprints, showing exactly how long that journey takes and how much the underlying science changes along the way.
The PreprintToPaper dataset is an openly available metadata collection — created by researchers Fidan Badalova, Julian Sienkiewicz, and Philipp Mayr and published in Scientific Data in 2026 — that connects bioRxiv preprints to their eventual journal publications using automated title-similarity, author-similarity, and DOI matching.
- What is the PreprintToPaper dataset?
- How does bioRxiv link a preprint to its published paper?
- What publication delays does the dataset reveal?
- How much do titles and abstracts change before publication?
- Answer-first Q&A: common preprint-linkage questions
- What are the implications for institutions and publishers?
What is the PreprintToPaper dataset?
PreprintToPaper is a metadata dataset covering 145,517 bioRxiv preprints across two periods: 34,246 preprints from 2016–2018 (pre-pandemic) and 111,271 from 2020–2022 (pandemic era). Records were built by querying the bioRxiv API for preprint metadata and the Crossref API for journal-publication metadata, then linking the two sets algorithmically.
The dataset sorts every preprint into one of three categories:
| Category | Definition | Count | Share |
|---|---|---|---|
| Published | Formally linked to a journal article on bioRxiv, with a DOI to the version of record | 90,614 | 62.3% |
| Preprint Only | No matching journal publication identified | 35,813 | 24.6% |
| Gray Zone | Highly likely published, based on title and author matching, but with no DOI link recorded on bioRxiv | 19,090 | 13.1% |
The Gray Zone category is the dataset’s key methodological contribution. Earlier work — including Abdill and Blekhman’s 2019 analysis in eLife, cited via PubMed Central, which found 42.0% of 15,797 sampled bioRxiv preprints had been formally linked to a published version — relied only on bioRxiv’s own DOI links. PreprintToPaper shows that a further 13.1% of preprints were very likely published but never picked up by that automatic link.
How does bioRxiv link a preprint to its published paper?
bioRxiv’s own linking mechanism is largely automatic. According to bioRxiv’s official FAQ, the platform “will usually automatically add a link to the published version within approximately two (2) weeks of journal publication,” after which the corresponding author receives a confirmation email.
Matching fails occasionally — usually when the title, author list, or venue changes substantially between versions. bioRxiv advises authors to wait two to three weeks after publication before contacting staff directly if no link appears. PreprintToPaper formalises this same matching logic for research purposes, using:
- A title-similarity score (via Python’s SequenceMatcher, measuring longest common subsequence) with a 0.75 threshold for a probable match;
- An author-similarity score and an author-count difference to validate borderline cases;
- Human annotation of 299 borderline records by two independent reviewers, reaching a Cohen’s kappa of 0.86 — a strong agreement level for a manual validation exercise.
Records with an author-match score above 0.47 were used to reclassify apparent non-publications into the Gray Zone, which is what allows the dataset to correct for bioRxiv’s own linking gaps rather than simply repeating them.
What publication delays does the dataset reveal?
Publication rates were not stable across the study window. PreprintToPaper’s authors report that the confirmed publication rate ranged from 71% for preprints posted in 2016 down to 49% for those posted in 2022 — an apparent decline that is substantially narrowed once Gray Zone cases with an author-match score above 0.47 are counted as published rather than unlinked.
This pattern is consistent with independent findings on preprint-to-publication timing. Earlier tracking studies of bioRxiv preprints reported a pre-pandemic median delay of around 166 days between posting and journal publication, while pandemic-era analyses of COVID-19 preprints found a much shorter median lag, reflecting accelerated peer review for urgent public-health findings. The apparent fall in 2022 publication rates most likely reflects a right-censoring effect — recent preprints simply have not yet had time to complete peer review and appear as “published” in the dataset’s snapshot — rather than a genuine drop in eventual publication.
How much do titles and abstracts change before publication?
PreprintToPaper stores both the initial submitted metadata and the final published metadata for each linked record — title, abstract, author list, journal name, and publication date — explicitly to support research on linguistic and structural change between preprint and published versions, including title reformulations and author-order shifts.
This matters because bioRxiv’s own FAQ already flags a related, more mundane source of variation: metadata such as the manuscript title, author list, and abstract are initially supplied by the author at submission, then replaced with metadata extracted from the PDF once full-text HTML is generated — meaning small differences can appear even before any journal ever sees the paper. Distinguishing that housekeeping-level drift from substantive, peer-review-driven revision is precisely the analytical opportunity the new version-history subset unlocks, and is why the dataset’s authors built author-count-difference and title-similarity fields as first-class, machine-readable variables rather than leaving them buried in free text.
Answer-first Q&A: common preprint-linkage questions
How do I link a preprint to a published paper?
For bioRxiv preprints, no manual action is normally required: bioRxiv’s system detects the journal publication and adds the link automatically, typically within two weeks of publication. If no link appears after two to three weeks, authors should contact bioRxiv staff directly so the match can be verified and added manually.
Does bioRxiv count as published?
No. A bioRxiv preprint is not peer-reviewed, edited, or certified by a journal, so it does not count as a formal publication. It is, however, a citable, DOI-bearing scholarly record that is indexed by Crossref, Google Scholar, Semantic Scholar, and Europe PMC, and NIH explicitly encourages citing preprints as interim research products.
Can I cite a preprint in my paper?
Yes. bioRxiv preprints should be cited by their DOI, in the format “Author AN, Author BT. Year. Title. bioRxiv doi: 10.1101/…”. If citing a specific revision, the version-specific URL should be added, since each preprint version remains permanently accessible under the same DOI.
How do I update bioRxiv with a published paper if the automatic link fails?
Authors should first wait two to three weeks past journal publication, since matching runs on a delay. If the link still has not appeared, the corresponding author should email bioRxiv staff or leave a comment on the preprint page; bioRxiv states it will verify all such requests before manually linking the record.
What are the implications for institutions and publishers?
For research administrators tracking outputs, PreprintToPaper’s Gray Zone category is a practical warning: relying solely on bioRxiv’s own “published” flag will undercount real publication rates by roughly 13 percentage points in this sample. Institutional repositories and research-information systems that harvest bioRxiv metadata directly should therefore treat unlinked-but-matched preprints as a distinct, reviewable category rather than as simply unpublished.
For publishers and editors, the dataset’s version-history subset offers a reusable framework for auditing how much a manuscript’s core claims shift between preprint and version of record — separating genuine post-review revision from routine metadata clean-up. That distinction is directly relevant to authorship practice, where author-order and contributor-list changes between preprint and publication are common but rarely tracked systematically, and to broader definitional work maintained in the CASRAI Dictionary of scholarly-communication terms.
The dataset itself, along with its code, is openly deposited on Zenodo, giving any institution the means to replicate or extend the analysis against its own output list rather than treating bioRxiv’s publication status as a black box.