Tag: W3C PROV

  • Data Provenance: Tracking Research Data to Publication

    Research funders increasingly ask not just whether a dataset is open, but where it came from. Data provenance is the discipline of documenting a dataset’s origin, custody, and every transformation it undergoes between collection and publication — a distinct concern from data lineage, which maps only the technical pathway data takes through systems. As data management plans, repository deposits, and AI-training-data audits come under closer scrutiny, provenance metadata is becoming the connective tissue between “collected” and “citable.”

    What Is Data Provenance?

    Data provenance is the historical record of a dataset’s origin, custody, and processing history — who created or collected it, under what conditions, and what happened to it before it reached its published form. It functions as a chain of custody: not a single field in a metadata record, but a continuous trail spanning collection instruments, transformation scripts, quality checks, and every hand the data passed through.

    This differs from anonymisation or privacy-preserving techniques, which govern what can be disclosed about a dataset’s contents. Provenance governs what can be verified about a dataset’s history — a governance question, not a disclosure-control one.

    Data Provenance vs Data Lineage

    The two terms are frequently used interchangeably, but the ELIXIR Research Data Management Kit (RDMkit) draws a useful distinction: lineage traces the technical movement of data between systems — extract, transform, load, output — while provenance adds the contextual and authorship layer: who authorised each step, why it happened, and under what licence or methodology.

    • Data lineage answers: which pipeline stages did this data pass through, and in what order?
    • Data provenance answers: who is accountable for each stage, and can that history be trusted and cited?

    In practice, a well-built pipeline produces both: lineage as the operational map, provenance as the governance record layered on top of it.

    Provenance Standards: W3C PROV, RDA and RO-Crate

    Provenance only becomes machine-actionable — and therefore auditable at scale — once it is captured against a shared model rather than free text. The W3C PROV family (PROV-DM, PROV-O, PROV-N) is the reference data model, formally recommending how to describe “entities,” “activities,” and “agents” so provenance graphs can be exchanged between systems. The Research Data Alliance (RDA) has convened interest groups aligning disciplinary metadata practices with PROV-DM, and repository-facing specifications build on top of it.

    Standard / Framework Steward What It Captures
    PROV-DM / PROV-O / PROV-N W3C Formal graph model of entities, activities and agents; RDF/OWL-serialisable provenance
    RO-Crate Research Object community (schema.org-based) Packages a dataset with its licence, workflow-run history and provenance in one archive
    ISO 19115-2 ISO Lineage extension for geographic and imagery metadata
    DataCite Metadata Schema DataCite Related-identifier relationship types (IsDerivedFrom, IsSourceOf) linking a dataset DOI to its origin and outputs

    Discipline-specific profiles then sit on top of these: FAIRsharing and RDA’s standards directory catalogue hundreds of provenance and metadata schemas so groups do not reinvent the model for each field.

    Building a Custody Chain from Collection to Publication

    A defensible provenance record follows the dataset through five stages, each logged with enough detail that a third party could reconstruct the history without contacting the original team.

    • Collection: instrument or method, collector identity (an ORCID iD is the practical anchor), date, and location captured at source.
    • Transformation: every cleaning, normalisation, aggregation or filtering step logged with the tool and version used.
    • Review: who validated the data, what checks were applied, and what was flagged or excluded.
    • Deposit: registration in a repository with a persistent identifier — a DataCite or CrossRef DOI — and an ROR identifier for the responsible institution.
    • Citation and reuse: downstream citations captured so the provenance trail extends forward into the published research output that relies on it.

    Contributor-role taxonomies help name accountability at each stage. The CRediT contributor role of “Data Curation,” for example — a taxonomy CASRAI originated in 2014 and which is now stewarded by NISO as ANSI/NISO Z39.104-2022 — gives institutions a controlled vocabulary for naming who performed which custody step, complementing PROV-O’s more technical entity/activity/agent model. Research administrators building data management plans can pair the two: CRediT roles for human accountability, PROV-DM for machine-actionable history.

    Common Questions About Data Provenance

    What is data provenance?

    Data provenance is the documented history of a dataset’s origin and custody — who collected it, under what method, and what transformations it underwent before use. It functions as a chain of custody, supporting authenticity checks, quality auditing, and reproducibility of any research output that relies on the data.

    What is data provenance vs lineage?

    Data lineage maps the technical route data takes between systems — extraction, transformation, loading. Data provenance adds the accountability layer: who authorised each step, why it occurred, and under what licence. Lineage is the operational map; provenance is the governance record built on top of it.

    What are the two classes of data provenance?

    Provenance literature typically distinguishes backward (retrospective) provenance, which reconstructs a dataset’s origin and history after the fact, from forward (prospective) provenance, which records how data is expected to move and transform in a defined future workflow before it happens.

    What does provenance mean?

    Outside data contexts, provenance refers to the documented history of ownership or origin of an object — the term used to authenticate artworks and manuscripts. Applied to research data, the same principle holds: a verifiable record of origin that supports trust, exactly as a chain of custody supports evidentiary trust in other domains.

    Why Provenance Completes FAIR: Implications for Institutions

    The FAIR data principles (Findable, Accessible, Interoperable, Reusable) are frequently treated as a checklist for open deposit, but the Reusable facet explicitly requires more than a licence tag. Principle R1.2 states that “(meta)data are associated with detailed provenance” — a sub-principle that is easy to satisfy nominally and hard to satisfy meaningfully. A dataset can be technically Findable and Accessible while its provenance metadata is a single free-text sentence, which leaves reproducibility unverifiable in practice.

    This gap matters more as scrutiny of dataset origin intensifies elsewhere. MIT Media Lab’s audit of over 1,800 AI training datasets found licence omission or miscategorisation in more than two-thirds of cases — a warning sign for any field, including research data management, that treats provenance as an afterthought rather than a captured-at-source discipline.

    For institutions building or refreshing data management plans under UKRI or Horizon Europe funding requirements, the practical implication is straightforward: provenance capture belongs at collection time, encoded against PROV-DM or an equivalent model, not reconstructed retrospectively when a journal, repository, or auditor asks for it. Research administrators, repository managers, and publishers who build custody-chain logging into their research administration workflows now will find FAIR compliance — and reproducibility review — considerably less costly later.