Tag: preprint

  • Preprints and peer review: how the version of record fits together

    A single piece of research now commonly exists in three or four forms at the same time: a preprint posted before review, an accepted manuscript that has passed review but not yet been typeset, and the final published version of record — sometimes with a later corrected or updated version on top. Readers, and even authors, routinely confuse them, and citing the wrong one can misrepresent what was actually validated. This article sets out what each version is and how peer review sits between them. It builds on the broader taxonomy in the research-outputs domain and pairs with the side-by-side explainer at preprint versus published article.

    The versions, in order

    The preprint

    A preprint is a complete research manuscript posted to a public server — arXiv, bioRxiv, medRxiv, SSRN, and many others — before, or in parallel with, formal peer review. Its defining feature is speed and openness: it makes findings available immediately and citable via a persistent identifier, usually a DOI, without waiting for a journal’s review cycle. Its defining limitation is the flip side of the same coin: a preprint has not been through independent peer review, so its claims have not been externally vetted. A preprint is a legitimate, citable output — not a lesser draft — but it carries a different epistemic status from a reviewed article, and that status must be made clear wherever it is used.

    Peer review: the step between

    Between the preprint and the published article sits peer review — independent evaluation by qualified reviewers, organised by a journal editor, which may accept, reject, or (most often) require revision. Peer review is not a guarantee of correctness; it is a quality-control and improvement process. What it changes is the manuscript’s standing: a reviewed and accepted article carries the journal’s editorial endorsement that the work met its standards, which a preprint does not. Understanding this is the key to the whole picture — the versions differ mainly in what has happened to them, and peer review is the event that separates the unreviewed preprint from the validated article.

    The accepted manuscript (postprint)

    Once peer review concludes and the journal accepts the paper, the author’s final reviewed-and-revised file is the accepted manuscript, often called the postprint or author-accepted manuscript (AAM). It contains the intellectual content that passed review but lacks the publisher’s copy-editing, typesetting, and final pagination. The accepted manuscript is the version most commonly self-archived in institutional repositories under green open access, frequently after an embargo. It is content-equivalent to the published article in its claims, but it is not the citable, formatted final object.

    The version of record

    The version of record (VoR) is the final, published, formally citable version: copy-edited, typeset, paginated, assigned its DOI, and lodged with the publisher as the authoritative instance of the work. It is the version the scholarly record points to, the one that carries any later corrections or retractions, and the one that should normally be cited. The concept of a version of record exists precisely so that, among several coexisting forms, there is one designated authoritative object that the record and its corrections attach to.

    How they fit together

    The clean way to hold this in mind is as a sequence of states of one work:

    1. Preprint — complete, public, citable, not peer-reviewed.
    2. (Peer review happens.)
    3. Accepted manuscript / postprint — peer-reviewed content, not yet publisher-formatted; the usual green-OA archive copy.
    4. Version of record — the final, formatted, authoritative, citable version.

    Crucially, these can all exist simultaneously and should link to one another. A well-managed preprint server displays a link from the preprint to the published version of record once it appears; the version of record, in turn, may acknowledge the preprint. Persistent identifiers are what make this linkage reliable: the preprint and the VoR each have their own DOI, and the relationship between them is recorded in metadata so that a reader arriving at one can find the other.

    Which version to cite

    • Cite the version of record where it exists. It is the authoritative, corrected, formally published instance, and citing it ensures your reference points to what was validated and to any subsequent corrections.
    • Cite the preprint as a preprint when that is genuinely what you used — for example a result not yet published elsewhere — and label it clearly as a preprint, with its DOI, so a reader knows it has not been peer-reviewed.
    • Do not cite a preprint as though it were the published article. If a version of record now exists, prefer it; the preprint and the final version can differ in their conclusions after revision.
    • Check for a newer version. Preprints are often updated; the VoR may carry corrections. Cite the specific version you relied on, and prefer the most authoritative current one.

    A note on what preprints do and do not change

    Preprints have made research faster and more open, and they are now a first-class part of the scholarly record rather than a fringe practice. But they do not replace peer review or the version of record; they sit before them. The healthiest reading of the current landscape is not preprint versus journal but a pipeline in which the same work moves from open-but-unreviewed to reviewed-and-authoritative, with each stage clearly labelled and linked. Confusion arises only when the labels are dropped — when a preprint is presented, or cited, as if it had the standing of the version of record.

    Where shared vocabulary fits

    “Preprint”, “postprint”, “accepted manuscript”, and “version of record” are used inconsistently — and sometimes interchangeably — across servers, repositories, and citation styles, which is exactly how the wrong version ends up cited. A shared, federated vocabulary that defines these versions precisely and records the relationships between them is what lets a citation point unambiguously to the right object. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the research-outputs domain.

    Related reading

  • Beyond the article: a modern taxonomy of research outputs

    For most of the history of the scholarly record, “research output” meant one thing: the peer-reviewed journal article, with the book a distant second. Everything else — the data, the code, the protocols, the negative results — was apparatus or supplement, uncounted and largely uncredited. That assumption has broken down, and a modern outputs taxonomy has to reflect a far wider range of things that researchers produce, each deserving its own place, its own identifier, and its own recognition in assessment. This article surveys that expanded taxonomy, drawing on the research-outputs domain.

    Why the article-only model failed

    The article-centric model failed for a simple reason: the article is no longer where much of the value lives. A reproducible computational study’s value is as much in its code and data as in its prose. A widely reused dataset can influence a field more than the paper that introduced it. A protocol followed by hundreds of labs is a contribution in its own right. Treating all of these as mere supplements to an article misallocates credit, loses the artefacts that actually get reused, and gives assessment a distorted picture of what a researcher contributed. The expansion of the outputs taxonomy is not taxonomic enthusiasm; it is a correction.

    The expanded output types

    A modern taxonomy organises a wide range of outputs. Several stand out as having reshaped the landscape.

    • The preprint — a manuscript posted to a public server before or during formal peer review — is now a first-class output, not a second-class draft. It establishes priority, accelerates dissemination, and carries its own DOI. The relationship between a preprint and its eventual published version is itself metadata worth recording.
    • The dataset — a collection of research data with a DataCite DOI — is the output whose recognition has changed most. Data citation is now expected practice, and a well-curated, documented dataset is a citable contribution that can be credited and assessed.
    • Research software — software produced for or as part of research, with a stable identifier such as a Software Heritage ID or a DataCite DOI — is increasingly recognised as a research output, with its own citation conventions and its own (imperfect) fit to contributorship taxonomies.
    • The trained model — an AI/ML model released as a research output, typically documented with a model card — is the newest major addition, reflecting the rise of machine-learning research that produces models and datasets rather than only papers.
    • The registered report — published in two stages, with the protocol peer-reviewed and accepted before data collection — is a structural innovation in how an output is produced, designed to guard against publication bias by committing to publish regardless of outcome.

    Beyond these, the taxonomy reaches further: protocols with DOIs (as minted on platforms like protocols.io), negative-results reports, systematic reviews and their living variants, policy briefs, standards contributions, patents, clinical-trial registrations, theses, conference papers, and the practice-based outputs of the arts. The breadth is the point: research produces many kinds of thing, and a taxonomy that names only one of them is misleading by omission.

    Two structural requirements: identifiers and relationships

    An outputs taxonomy is only useful if its entries can be reliably identified and related. Two requirements follow.

    The first is stable identifiers for every output type, not just articles. A dataset needs a DOI, software needs a SWHID or DOI, a sample referenced by an output needs an IGSN, a project that produced the outputs needs a RAiD, and the people and institutions need ORCID and ROR. Without identifiers, the expanded taxonomy is just a longer list of things that cannot be cited or counted reliably. With them, every output type becomes a first-class, citable, assessable entity.

    The second is clean parent-child and related relationships between output types. A registered report’s stage-1 protocol and stage-2 article are related; a preprint and its published version are related; a dataset and the software that processed it are related; a systematic review and the studies it synthesises are related. A taxonomy that captures these relationships lets automated systems and CRIS platforms reason over outputs — grouping a project’s preprint, dataset, and software as facets of one contribution rather than three unconnected records.

    Why this matters for assessment

    The expanded taxonomy connects directly to responsible research assessment. Narrative-CV formats explicitly invite researchers to describe contributions beyond publications — the datasets, the software, the open-science work. But for an assessor to take a dataset or a model seriously, it has to be a recognised, identifiable output type, not an undifferentiated “other.” A modern outputs taxonomy is the precondition for assessment that values what researchers actually produce. Naming a model, a dataset, or a protocol as a first-class output is what lets it be claimed on a CV and weighed by a panel.

    A caution against type proliferation

    A taxonomy can fail in two directions. The old failure was too few types — everything that was not an article was invisible. The opposite failure is too many: a sprawling list of hyper-specific types that no two systems classify the same way, so that exchange becomes impossible and the taxonomy collapses under its own weight. The discipline a good taxonomy needs is to enumerate the types that genuinely behave differently — that have different identifiers, different lifecycles, different assessment treatment — and to use relationships rather than ever-finer types to capture the rest. The goal is a taxonomy that classifiers and CRIS systems can apply consistently, which means stable, well-bounded types with clean relationships, not an open-ended catalogue.

    Where the dictionary fits

    Several stewards already maintain output-type vocabularies — COAR Resource Types, the Crossref and DataCite output types, the categories used by national assessment exercises. The need is not another competing list but an integrative, operational reference that defines each type clearly, federates to those stewards, and makes the relationships between types explicit. Providing that — so that a “dataset” or a “registered report” means the same thing across systems — is the convening role the CASRAI dictionary is designed for.

    What to do now

    For researchers: mint identifiers for all your outputs, not only your papers, and record the relationships between them. For institutions and CRIS owners: support the full range of output types as first-class records with clean relationships, federating your type list to an established vocabulary rather than inventing one. For assessment: recognise the expanded taxonomy, so that the dataset, the model, and the protocol can be claimed and weighed alongside the article.

    Related reading