Tag: PID adoption

  • ORCID Database: Inside the Public Data File

    The ORCID database’s annual public data file is a bulk, machine-readable snapshot of every public record in the ORCID registry, released once a year under a CC0 public-domain dedication. It is not the same thing as ORCID’s live summary statistics page — the data file is a static, downloadable dataset built for large-scale analysis of PID adoption, while the statistics page is a running counter of registrations. Together they answer different questions for anyone studying how persistent identifiers are being taken up across research.

    ORCID (Open Researcher and Contributor ID) is a non-profit registry that issues a free, unique 16-character identifier so that individual researchers and contributors can be distinguished from others with similar or identical names. The ORCID database that underpins this registry is what the annual public data file exports in bulk form — and that export is the subject of this analysis.

    What is the ORCID public data file?

    The ORCID public data file is a full export of every field that ORCID record-holders have marked as publicly visible, packaged as a single downloadable dataset rather than served record-by-record through the API. ORCID has released one of these files annually since 2013, hosting each release on the Figshare repository with a persistent DOI so that the exact version used in a study can always be cited.

    Access requires no ORCID membership and no API credentials. Anyone — a bibliometrician, a funder’s policy team, a university library, or an independent developer — can download the file directly. This “no gatekeeping” design is deliberate: ORCID’s registry exists to resolve author-name ambiguity across the whole scholarly ecosystem, and the organisation has treated bulk openness of public data as part of that public-interest mandate since the file’s first release.

    What does the annual dataset actually contain?

    Since the 2018 release, the public data file has been split into two components rather than one monolithic archive. This structural change reflects the growing size and complexity of individual records as ORCID’s activity metadata schema expanded.

    • Summaries file: one compact record per ORCID iD, covering biography, employment, education and other profile-level fields.
    • Activity files: separate, more granular files carrying the full public detail of works, funding, peer review and other activities linked to each iD.

    Both components are distributed as XML, the format ORCID’s underlying registry schema is built on; community-maintained conversion tools exist for teams that prefer JSON for downstream processing. Because works metadata in the schema can also carry contributor-role tags, the dataset increasingly includes role-level authorship detail as well as bare authorship claims — useful for anyone tracking how granular contribution reporting is spreading, distinct from simple co-authorship lists.

    As of August 2022, ORCID’s own statistics reported 14,727,479 live iDs and 1,258 member organisations, according to figures published on the ORCID Statistics page and reproduced in ORCID’s public reporting. Registration volumes of that scale are exactly what make the annual file a meaningful basis for adoption-trend research rather than a curiosity dataset.

    How does it differ from ORCID’s summary statistics?

    ORCID’s public-facing statistics page shows a live, aggregate count — total registrations, year-on-year growth, member numbers — updated continuously as the registry changes. The public data file is the opposite in every operational sense: a frozen, record-level snapshot taken at a fixed point in time, distributed once a year, and never updated after release.

    Attribute Public data file Summary statistics
    Granularity Every public field of every public record Aggregate totals only
    Update frequency Annual (fixed snapshot) Continuous / real time
    Format Bulk XML archive, downloaded once Web page / lightweight API counters
    Licence CC0 1.0 public domain dedication Published figures, not a dataset
    Typical user Researchers, funders, PID analysts General public, journalists, members

    This distinction matters for anyone citing ORCID in research administration literature: a claim about “how many researchers have ORCID iDs today” belongs to the statistics page, while a claim about “what fraction of ORCID works records carry funder identifiers” or “how affiliation self-reporting has changed by country” can only be answered from the bulk file itself.

    What can researchers do with the open dataset?

    Because the file is CC0-licensed and covers the full registry rather than a sample, it supports analysis no API query against individual records could replicate at scale. Typical uses include:

    • Measuring PID adoption trends by country, discipline or institution type over successive annual releases
    • Cross-linking ORCID iDs to DataCite and Crossref DOI metadata to study identifier coverage across the publication-funding-repository chain
    • Auditing how completely researchers populate employment and affiliation fields, which underpins institutional-attribution accuracy in research information systems
    • Building reproducible, citable PID-landscape studies, since each annual file carries its own Figshare DOI

    Since October 2015, DataCite and Crossref have used ORCID’s auto-update mechanism to write newly registered DOI metadata directly into linked ORCID records, which means the annual file increasingly reflects publication and dataset activity that researchers never manually entered themselves — a provenance detail that matters when interpreting completeness metrics from the dump.

    Answer-first Q&A

    What is the ORCID database?

    The ORCID database is the registry of unique 16-character identifiers, and associated public profile data, that ORCID Inc. maintains to distinguish individual researchers and contributors. It underlies both the live registry website and the annual public data file that exports the registry’s public content in bulk.

    Is ORCID iD public?

    An ORCID iD itself is always public once created, but the surrounding profile data is not automatically so. Record-holders set visibility settings field-by-field, and only fields marked public are exported into the annual data file or returned by the public API.

    Is ORCID free to use?

    Yes. Registering for and using an ORCID iD is free for individual researchers, and the public data file itself is free to download under a CC0 dedication. ORCID’s revenue instead comes from paid membership fees charged to institutions, publishers and funders that integrate with the registry.

    How do you find an ORCID iD?

    Individuals can search the ORCID registry directly by name at the public ORCID website, or look up a specific record via its 16-character identifier. Institutions and developers needing bulk lookups instead query the public API or work from the annual data file rather than searching one iD at a time.

    Implications for institutions and PID researchers

    For research administrators and institutional leaders, the annual public data file is the only reliable way to benchmark ORCID adoption across a whole sector rather than a single institution’s membership dashboard. Funders assessing whether a mandate for ORCID iDs has actually changed researcher behaviour need a full-registry snapshot, not a live counter that only reports totals.

    For developers and PID researchers, the file’s annual cadence and DOI-stamped releases mean every study can specify exactly which snapshot it used — a reproducibility property that live API queries, by their continuously-changing nature, cannot offer. As ORCID’s works metadata increasingly captures structured contributor-role information, future editions of the public data file are likely to become a primary source for studying how granular authorship attribution is spreading across disciplines, alongside identifier adoption itself.