Tag: text and data mining

  • PMC Open Access Subset vs Plan S: Not the Same

    The PMC Open Access Subset and Plan S are not the same thing. The PMC Open Access Subset is a licensing classification inside PubMed Central (PMC) that flags which archived articles carry reuse-permitting licences for text mining and redistribution. Plan S is a funder mandate from cOAlition S that requires immediate open access publication of funded research. One is a repository filter; the other is a compliance requirement — and confusing them leads authors to think a PMC listing satisfies a funder’s open access policy when it may not.

    The PMC Open Access Subset is the portion of PubMed Central’s full-text archive made available under Creative Commons or similar licences that permit reuse beyond reading, including text mining and redistribution. This distinction — repository versus mandate — is the source of a persistent mix-up among authors preparing to comply with funder open access requirements.

    What Is the PMC Open Access Subset?

    The PMC Open Access Subset is maintained by the U.S. National Library of Medicine (NLM), part of the National Institutes of Health (NIH). It contains articles and preprints made available under machine-readable licences — Creative Commons or similar — that permit reuse beyond simple reading access.

    NLM groups the subset into three licence tiers:

    • Commercial Use Allowed — CC0, CC BY, CC BY-SA, CC BY-ND licences
    • Non-Commercial Use Only — CC BY-NC, CC BY-NC-SA, CC BY-NC-ND licences
    • Other — no machine-readable licence, no licence, or a custom licence, with restricted redistribution on the PMC Cloud Service

    As of the NIH’s most recent update, the subset spans well over 3.4 million journal articles and preprints, retrievable via the PMC FTP Service, Cloud Service, OAI-PMH Service, or BioC API. Not every article in PMC belongs to the Open Access Subset — many PMC-hosted articles remain under standard copyright and are excluded from bulk text-mining retrieval.

    This is a critical, frequently missed distinction: PMC itself (the archive) and the NIH Public Access Policy (which mandates deposit of NIH-funded manuscripts into PMC) are separate from the Open Access Subset (the licensing classification). An article can be freely readable in PMC under the Public Access Policy while still sitting outside the Open Access Subset, because it lacks a reuse-permitting licence.

    What Is Plan S?

    Plan S is a funder-driven open access initiative launched in September 2018 by cOAlition S, a coalition of national and international research funders including UKRI, Wellcome, and members of the European Commission’s Horizon Europe programme. It requires that peer-reviewed publications arising from funded research be made immediately and fully open access, with no embargo period.

    Under Plan S principles, compliant publication routes include:

    • Publishing in a fully open access journal or platform
    • Publishing in a subscription journal while depositing the accepted manuscript in an open access repository immediately on publication (the “Rights Retention Strategy”)
    • Publishing on an open access platform or in a repository that meets cOAlition S technical requirements

    cOAlition S states that authors or their institutions should retain copyright, and that a Creative Commons Attribution (CC BY) licence is the preferred licence type. Compliance is assessed against funder-specific policy terms, not against any single repository’s inclusion criteria.

    PMC Open Access Subset vs Plan S: Key Differences

    The clearest way to separate these two is by function: a repository classification versus a funder policy. The table below sets this alongside a third commonly conflated mechanism — the United States’ federal public access requirement — since UK and international researchers frequently encounter all three in the same compliance conversation.

    Feature PMC Open Access Subset Plan S US federal public access mandate
    Nature Repository licensing classification Funder policy mandate Federal agency policy (via OSTP)
    Governing body National Library of Medicine (NIH) cOAlition S funders Office of Science and Technology Policy (OSTP)
    What it governs Reuse rights of archived articles Where/how funded research is published Timing of public access to federally funded research
    Embargo position Not applicable — licence-based, not time-based Zero embargo required from 2021 Zero embargo required by 31 December 2025 (OSTP’s 2022 Nelson Memo)
    Geographic scope Global archive, US-hosted Primarily European and international funders United States federal agencies
    Enforcement mechanism None — it is a content filter, not a compliance check Funder grant conditions Agency public access plans

    The overlap that causes confusion: research funded under Plan S can end up in the PMC Open Access Subset if it carries a qualifying licence, but Plan S compliance is judged by the funder against its own policy terms, not by whether NLM has classified the article into the subset.

    Does Plan S Compliance Require the PMC Open Access Subset?

    No. Plan S does not name the PMC Open Access Subset as a compliance route. cOAlition S funders accept publication in a compliant journal, an institutional or subject repository meeting technical requirements, or immediate deposit of the accepted manuscript under an approved licence. PMC is one possible repository destination for biomedical research, but Plan S compliance is assessed by licence terms and embargo length, not by NLM’s internal subset classification.

    Authors publishing biomedical research funded by a cOAlition S member should check the funder’s own open access policy and, separately, confirm whether their institution or publisher will additionally deposit the manuscript into PMC. These are two distinct actions that happen to intersect for US-relevant biomedical literature, not one unified process.

    Common Questions

    What is PMC open access?

    PMC open access refers to the PMC Open Access Subset, the portion of PubMed Central archived under licences — typically Creative Commons — that permit reuse, including text mining and redistribution. It is not a funder policy; it is a licensing classification applied to specific articles already deposited in PMC.

    Are PMC and PubMed the same?

    No. PubMed is a database of citations and abstracts, while PMC (PubMed Central) is a full-text archive of biomedical journal articles. Both are maintained by the National Library of Medicine, but PubMed indexes metadata, whereas PMC stores the complete article text, of which only a subset carries reuse licences.

    Is PMC free to use?

    Yes, reading PMC articles is free. However, reuse rights differ by article: NLM states that PMC provides long-term preservation and free reading access, but text mining or redistribution beyond fair use requires the article to carry a qualifying licence within the Open Access Subset — free-to-read is not the same as free-to-reuse.

    Implications for Authors and Institutions

    For authors, the practical takeaway is definitive: satisfying a funder’s Plan S obligation and appearing in the PMC Open Access Subset are two separate compliance checks. Meeting one does not automatically satisfy the other. Institutional research administration teams tracking funder compliance should verify licence type, embargo length, and deposit location independently for each requirement, rather than treating “it’s in PMC” as proof of open access mandate compliance.

    For publishers and repository managers, the distinction matters for metadata accuracy: an article’s PMC Open Access Subset licence tag should be checked and communicated separately from any funder compliance statement attached to the same article.

    Looking ahead, the gap between these mechanisms is narrowing. The US federal government’s move toward zero-embargo public access by the end of 2025, alongside Plan S’s established zero-embargo requirement since 2021, signals convergence on immediate access as the global norm — even though the underlying legal and technical mechanisms (funder mandate versus repository licence versus agency policy) remain distinct and will continue to require separate verification.

  • bioRxiv License Update: What Changed for Authors and Reuse

    bioRxiv’s licence update, live via the platform’s Author Area since January 2026, lets authors request a change to a less-restrictive Creative Commons licence on a preprint already posted — without submitting a new version. The change can only move in one direction, towards more permissive reuse, and it exists chiefly to help authors bring older preprints into line with funder mandates that require CC BY.

    bioRxiv is a free preprint server for the life sciences, operated by the non-profit organisation openRxiv, which also runs the companion server medRxiv for health-sciences preprints.

    Contents

    What is the bioRxiv licence update, and why was it introduced?

    The bioRxiv licence update is a self-service feature that lets a preprint’s corresponding or submitting author switch its Creative Commons licence to a less restrictive option after posting, without triggering a full revision. openRxiv documented the mechanics in a step-by-step guide published on 7 January 2026, and followed up with a policy explainer on 20 May 2026 setting out the rationale.

    The trigger is compliance drift. Openrxiv’s own explainer states that a growing number of funders “require their grantees to apply specific licenses to their preprints, typically CC BY,” but that “many authors are unaware of this” and post under a more restrictive option by default. Before this update, the only remedy was submitting an entirely new version of the preprint and re-selecting a licence — a heavier process that also generates a fresh revision record. This is distinct from an earlier, smaller change in January 2025, when bioRxiv and medRxiv reordered their licence-selection menus to place CC BY at the top of the list; the 2026 update is the first mechanism that lets authors retroactively fix the licence on preprints they have already posted.

    How do authors request a licence change?

    The workflow runs entirely through the bioRxiv submission system’s Author Area and does not require re-uploading a manuscript. It applies only to the most recent version of a preprint, and only to preprints posted within the past two years.

    • Log into the Author Area from the bioRxiv submit page.
    • Locate the “Request License Update” box on the right-hand side of the page.
    • Select “Update license choice on previously posted papers.”
    • Choose the eligible preprint by its manuscript ID (only papers where the requester was corresponding or submitting author are listed).
    • Select a new, less restrictive licence and submit the request; a confirmation email follows.

    Two constraints apply strictly. First, the feature is unavailable if an incomplete revision is already in the submission system, or if a previous licence request is still pending. Second, a request can even be made after the preprint has been formally published in a journal, since the licence sits on the preprint record independently of the journal’s own copyright terms.

    Licence options compared: what actually changed

    bioRxiv preprints have long offered a choice of Creative Commons licences plus a “no licence” (all rights reserved) default, and a CC0 public-domain option for US federal employees such as NIH intramural researchers. What changed in 2026 is not the menu of options — it is that authors can now move an already-posted preprint from a more restrictive option to a less restrictive one after the fact.

    Licence Commercial reuse Attribution required Text-and-data mining / AI training Typical funder fit
    CC BY Permitted Yes Unrestricted, including commercial use HHMI, Gates Foundation, most cOAlition S funders
    CC BY-ND Permitted (no derivatives) Yes Mining permitted; no adapted/derivative outputs distributed Rarely funder-compliant
    CC BY-NC Not permitted Yes Restricted to non-commercial use Non-compliant with CC BY mandates
    CC BY-NC-ND Not permitted Yes Most restrictive; non-commercial, no derivatives Rarely funder-compliant
    CC0 Permitted (public domain) No Unrestricted US federal/NIH intramural authors only
    No licence selected Not permitted without separate permission N/A Reuse requires author permission Non-compliant with most funder mandates

    Because Creative Commons licences are irrevocable once attached to a public copy of a work, the update only runs in the permissive direction. An author can move from CC BY-NC to CC BY; the system rejects a request to move from CC BY to a more restrictive licence, since existing downloaded and archived copies would remain under the original, broader terms regardless.

    What this means for CC-BY reuse, text-and-data mining, and AI training

    bioRxiv’s baseline terms of use already permit text-and-data mining of posted content, which is the legal hook that has made preprint corpora attractive training data for machine-learning systems. The licence attached to an individual preprint then determines the scope of onward reuse beyond that baseline — and this is where the 2026 update has practical bite.

    Under CC BY, any party — including a commercial AI developer — may reproduce, adapt, and redistribute the work, provided the original authors are credited. Under CC BY-NC or CC BY-NC-ND, commercial reuse (which covers most AI model training conducted by for-profit developers) is not licensed, regardless of the platform-level text-mining consent. That gap is precisely what several funders have moved to close: the Howard Hughes Medical Institute’s preprint requirement, effective 1 January 2026, and the Bill & Melinda Gates Foundation’s preprint mandate, in force since 1 January 2025, both require grantee preprints to carry CC BY. The licence-update feature exists to let authors already out of step with those mandates fix a specific preprint without a full resubmission.

    For institutions and research-integrity offices, the practical implication is that a preprint’s licence — not merely its posting on an open server — is the operative variable for downstream reuse and AI-training permissions. Auditing grantee preprints for licence compliance, not just for the fact of preprint deposit, is now a distinct compliance step.

    Answer-first Q&A

    How do I update a bioRxiv?

    Authors can request a licence update from the Author Area of the bioRxiv submission system, using the “Request License Update” box, without submitting a full revision. The change applies only to preprints posted in the past two years and only to the most recent version, moving to a less restrictive licence.

    What are the licence options for bioRxiv?

    bioRxiv authors can choose CC BY, CC BY-ND, CC BY-NC, CC BY-NC-ND, or leave the preprint with no licence (all rights reserved). A CC0 public-domain option is also available specifically for US federal employees, such as NIH intramural researchers.

    Does bioRxiv count as published?

    No. A bioRxiv preprint is not peer reviewed and does not constitute formal journal publication; it is a publicly posted manuscript with its own DOI. Authors remain free to submit the same work to a journal afterward, and the preprint record persists independently of that later publication.

    Who maintains bioRxiv?

    bioRxiv is operated by openRxiv, a non-profit organisation dedicated to advancing science communication, which also runs the companion health-sciences server medRxiv. openRxiv is supported by institutions including Cold Spring Harbor Laboratory, the Chan Zuckerberg Initiative, and the Sergey Brin Family Foundation.

    Implications for institutions, funders, and authors

    Research-administration offices tracking open-access compliance should treat the licence update as a remediation tool, not a substitute for correct licence selection at submission. It closes a specific gap — preprints posted before an author understood their funder’s CC BY requirement — but it does not apply to preprints older than two years, to superseded versions, or where a revision is already mid-process.

    For anyone advising authors on authorship rights and responsibilities, the clearest guidance is to check funder licensing terms before first posting, since fixing a mismatched licence later depends on the preprint still being within the two-year eligibility window. Related open-research terminology, including licensing and reuse definitions, is tracked in the CASRAI open-research dictionary.

    Expect other preprint servers to face similar pressure as CC BY mandates spread across research funders. The direction of travel — author-initiated, platform-mediated licence correction rather than manuscript resubmission — is a practical template other repositories are likely to adopt as funder compliance checks tighten.