scgpt biorxiv – CASRAI Dictionary

scGPT bioRxiv preprints, alongside ESM3, AlphaFold-Multimer, Geneformer, EvolvePro and Chai-2, illustrate a 2026 pattern: AI foundation models for biology now reach bioRxiv months or years before — and sometimes instead of — formal peer review, shifting scrutiny onto the research community itself.

A foundation model in biology is a large neural network pretrained on a broad corpus of sequence, structure or single-cell data, then fine-tuned for specific downstream tasks such as cell-type annotation, protein design or complex-structure prediction. bioRxiv is the open-access preprint server, now operated by the nonprofit openRxiv, where most of these models first appear.

What is the bioRxiv wave of AI biology preprints?
Which models are driving this trend?
Why publish before peer review?
What are the research-integrity and attribution risks?
Answer-first Q&A
Implications for institutions and publishers

What is the bioRxiv wave of AI biology preprints?

Since 2021, a cluster of high-profile AI foundation models for biology has appeared first as bioRxiv preprints rather than journal articles. scGPT, ESM3, AlphaFold-Multimer, Geneformer, EvolvePro and Chai-2 each disclosed model weights, training corpora and benchmark results on bioRxiv before, or without, completing formal peer review.

This is not unique to biology, but the scale is notable. bioRxiv’s bioinformatics collection alone now holds over 42,000 preprints, and many of the field’s most-cited foundation-model papers spent a year or more circulating in preprint form before any journal version existed.

Which models are driving this trend?

Each model targets a different layer of biology — from single cells to protein complexes — but all six followed the same preprint-first disclosure pattern, with varying paths to formal review.

Model	Domain	bioRxiv preprint date	Peer-review status	Headline result
scGPT	Single-cell multi-omics	1 May 2023	Nature Methods, 2024	Pretrained on over 10 million cells; preprint drew 1,490+ citations before formal publication
ESM3	Protein sequence/structure/function	2 July 2024	Science, January 2025	Generated esmGFP, a novel fluorescent protein only 58% identical to its nearest known relative
AlphaFold-Multimer	Protein complex structure	4 October 2021	Still bioRxiv-only	67% success rate on heteromeric interfaces despite ubiquitous structural-biology use
Geneformer	Single-cell network biology	No precursor preprint; v2 update posted August 2024	Nature, 31 May 2023	Pretrained on Genecorpus-30M, 29.9 million single-cell transcriptomes
EvolvePro	Protein engineering	17 July 2024	Still bioRxiv-only	2- to 515-fold activity gains across five therapeutic proteins
Chai-2	Antibody and miniprotein design	6 July 2025	Still bioRxiv-only	16% hit rate in de novo antibody design, over 100x prior computational methods

Two patterns stand out. First, Geneformer’s core 2023 paper went directly to Nature without a bioRxiv precursor, showing the pattern is not universal. Second, AlphaFold-Multimer, EvolvePro and Chai-2 remain, as of mid-2026, without any confirmed journal record despite being cited and deployed across thousands of downstream studies.

Why publish before peer review?

Competitive priority and speed dominate. Posting to bioRxiv creates a timestamped, public record of a result the moment it exists, which matters in a field where multiple labs often chase the same architecture within weeks of each other.

Immediate community stress-testing of code, weights and benchmark claims, often faster than a journal’s reviewer pool can respond.
Priority establishment ahead of competing labs working on the same problem class.
Faster onward use: downstream researchers can build on and cite a preprint immediately rather than waiting through a multi-month review cycle.

Journals have adapted to this reality. Many now formally accept bioRxiv-posted work, and scGPT’s own trajectory — a 2023 preprint that drew over 1,490 citations before its 2024 Nature Methods publication — shows how much scientific traffic a foundation model can carry while still formally unreviewed.

What are the research-integrity and attribution risks?

The lack of independent review before wide reuse is the core risk. A 2026 bioRxiv preprint on researcher perceptions found that scientists rely heavily on author reputation, rather than review status, as their main heuristic for judging a preprint’s credibility — a fragile substitute for structured peer review, particularly for tools other labs adopt wholesale.

Attribution is a related, distinct problem. When a foundation model like Chai-2 or ESM3 generates a candidate sequence that a human team then validates experimentally, contributor-credit questions arise: who conceived the method, who ran validation, and who is accountable for the claim. Both the International Committee of Medical Journal Editors and the Committee on Publication Ethics have stated that AI tools cannot be listed as authors, because they cannot take responsibility for the work’s accuracy or integrity.

Structured contributor-role frameworks help resolve this. CASRAI originated the CRediT contributor role taxonomy in 2014, and the standard is now stewarded by NISO as ANSI/NISO Z39.104-2022. Applying CRediT roles to preprint co-authorship — distinguishing methodology, software, validation and formal analysis — gives institutions a documented way to assign human accountability even when an AI foundation model contributed materially to the output. See the broader CRediT framework overview and CASRAI’s authorship resources for related guidance.

Answer-first Q&A

Has the scGPT bioRxiv preprint been peer reviewed?

Yes. The original scGPT preprint was posted to bioRxiv on 1 May 2023 and later passed formal peer review, publishing in Nature Methods in 2024. The preprint itself had already drawn more than 1,490 citations while still formally unreviewed.

Why do AI foundation models for biology publish on bioRxiv before peer review?

Competitive pressure and pace drive it. Posting to bioRxiv establishes priority and lets the wider research community stress-test claims, code and weights immediately, rather than waiting the months or years a formal peer-review cycle can take in a fast-moving field.

Is AlphaFold-Multimer peer reviewed?

No confirmed journal record exists for AlphaFold-Multimer itself; DeepMind’s preprint has remained on bioRxiv since 4 October 2021. It is nonetheless used routinely across structural biology — a stark example of a foundational tool that never completed formal peer review.

Who owns bioRxiv?

bioRxiv is operated by openRxiv, an independent nonprofit that assumed ownership from Cold Spring Harbor Laboratory in March 2025. The transfer aimed to secure the preprint server’s long-term governance as its role in disseminating AI foundation model research has grown.

Implications for institutions and publishers

Research offices and publishers now need explicit policy on how preprinted AI foundation models are cited, credited and re-used before formal review completes. Institutional research-integrity offices should treat a bioRxiv-only model — such as AlphaFold-Multimer, EvolvePro or Chai-2 — as provisionally validated, not settled science, when it underpins funded work or clinical-adjacent claims.

Research administrators managing grant compliance and output tracking should build preprint-status checks into their reporting workflows; CASRAI’s research administration resources outline how contributor-role and output-tracking practices adapt to fast-moving, preprint-first fields. As more foundation models follow this path, the distinction between “published” and “peer reviewed” will matter more, not less, for research integrity.

Tag: scgpt biorxiv

scGPT bioRxiv: AI Biology Models Bypass Review