{"id":2888,"date":"2026-07-02T23:24:32","date_gmt":"2026-07-02T23:24:32","guid":{"rendered":"https:\/\/casrai.org\/wp\/data-provenance-chain\/"},"modified":"2026-07-02T23:24:32","modified_gmt":"2026-07-02T23:24:32","slug":"data-provenance-chain","status":"publish","type":"post","link":"https:\/\/casrai.org\/wp\/data-provenance-chain\/","title":{"rendered":"Data Provenance: Tracking Research Data to Publication"},"content":{"rendered":"<ul>\n<li><a href=\"#what-is-data-provenance\">What Is Data Provenance?<\/a><\/li>\n<li><a href=\"#data-provenance-vs-data-lineage\">Data Provenance vs Data Lineage<\/a><\/li>\n<li><a href=\"#provenance-standards\">Provenance Standards: W3C PROV, RDA and RO-Crate<\/a><\/li>\n<li><a href=\"#custody-chain\">Building a Custody Chain from Collection to Publication<\/a><\/li>\n<li><a href=\"#faq\">Common Questions About Data Provenance<\/a><\/li>\n<li><a href=\"#implications\">Why Provenance Completes FAIR: Implications for Institutions<\/a><\/li>\n<\/ul>\n<p>Research funders increasingly ask not just whether a dataset is open, but where it came from. <strong>Data provenance<\/strong> is the discipline of documenting a dataset&#8217;s origin, custody, and every transformation it undergoes between collection and publication \u2014 a distinct concern from data lineage, which maps only the technical pathway data takes through systems. As data management plans, repository deposits, and AI-training-data audits come under closer scrutiny, provenance metadata is becoming the connective tissue between &#8220;collected&#8221; and &#8220;citable.&#8221;<\/p>\n<h2 id=\"what-is-data-provenance\">What Is Data Provenance?<\/h2>\n<p>Data provenance is the historical record of a dataset&#8217;s origin, custody, and processing history \u2014 who created or collected it, under what conditions, and what happened to it before it reached its published form. It functions as a chain of custody: not a single field in a metadata record, but a continuous trail spanning collection instruments, transformation scripts, quality checks, and every hand the data passed through.<\/p>\n<p>This differs from anonymisation or privacy-preserving techniques, which govern what can be disclosed about a dataset&#8217;s contents. Provenance governs what can be verified about a dataset&#8217;s history \u2014 a governance question, not a disclosure-control one.<\/p>\n<h2 id=\"data-provenance-vs-data-lineage\">Data Provenance vs Data Lineage<\/h2>\n<p>The two terms are frequently used interchangeably, but the <a href=\"https:\/\/rdmkit.elixir-europe.org\/data_provenance\">ELIXIR Research Data Management Kit (RDMkit)<\/a> draws a useful distinction: lineage traces the technical movement of data between systems \u2014 extract, transform, load, output \u2014 while provenance adds the contextual and authorship layer: who authorised each step, why it happened, and under what licence or methodology.<\/p>\n<ul>\n<li><strong>Data lineage<\/strong> answers: which pipeline stages did this data pass through, and in what order?<\/li>\n<li><strong>Data provenance<\/strong> answers: who is accountable for each stage, and can that history be trusted and cited?<\/li>\n<\/ul>\n<p>In practice, a well-built pipeline produces both: lineage as the operational map, provenance as the governance record layered on top of it.<\/p>\n<h2 id=\"provenance-standards\">Provenance Standards: W3C PROV, RDA and RO-Crate<\/h2>\n<p>Provenance only becomes machine-actionable \u2014 and therefore auditable at scale \u2014 once it is captured against a shared model rather than free text. The <a href=\"https:\/\/www.w3.org\/TR\/prov-overview\/\">W3C PROV family<\/a> (PROV-DM, PROV-O, PROV-N) is the reference data model, formally recommending how to describe &#8220;entities,&#8221; &#8220;activities,&#8221; and &#8220;agents&#8221; so provenance graphs can be exchanged between systems. The Research Data Alliance (RDA) has convened interest groups aligning disciplinary metadata practices with PROV-DM, and repository-facing specifications build on top of it.<\/p>\n<table>\n<tr>\n<th>Standard \/ Framework<\/th>\n<th>Steward<\/th>\n<th>What It Captures<\/th>\n<\/tr>\n<tr>\n<td>PROV-DM \/ PROV-O \/ PROV-N<\/td>\n<td>W3C<\/td>\n<td>Formal graph model of entities, activities and agents; RDF\/OWL-serialisable provenance<\/td>\n<\/tr>\n<tr>\n<td>RO-Crate<\/td>\n<td>Research Object community (schema.org-based)<\/td>\n<td>Packages a dataset with its licence, workflow-run history and provenance in one archive<\/td>\n<\/tr>\n<tr>\n<td>ISO 19115-2<\/td>\n<td>ISO<\/td>\n<td>Lineage extension for geographic and imagery metadata<\/td>\n<\/tr>\n<tr>\n<td>DataCite Metadata Schema<\/td>\n<td>DataCite<\/td>\n<td>Related-identifier relationship types (IsDerivedFrom, IsSourceOf) linking a dataset DOI to its origin and outputs<\/td>\n<\/tr>\n<\/table>\n<p>Discipline-specific profiles then sit on top of these: FAIRsharing and RDA&#8217;s standards directory catalogue hundreds of provenance and metadata schemas so groups do not reinvent the model for each field.<\/p>\n<h2 id=\"custody-chain\">Building a Custody Chain from Collection to Publication<\/h2>\n<p>A defensible provenance record follows the dataset through five stages, each logged with enough detail that a third party could reconstruct the history without contacting the original team.<\/p>\n<ul>\n<li><strong>Collection:<\/strong> instrument or method, collector identity (an ORCID iD is the practical anchor), date, and location captured at source.<\/li>\n<li><strong>Transformation:<\/strong> every cleaning, normalisation, aggregation or filtering step logged with the tool and version used.<\/li>\n<li><strong>Review:<\/strong> who validated the data, what checks were applied, and what was flagged or excluded.<\/li>\n<li><strong>Deposit:<\/strong> registration in a repository with a persistent identifier \u2014 a DataCite or CrossRef DOI \u2014 and an ROR identifier for the responsible institution.<\/li>\n<li><strong>Citation and reuse:<\/strong> downstream citations captured so the provenance trail extends forward into the published research output that relies on it.<\/li>\n<\/ul>\n<p>Contributor-role taxonomies help name accountability at each stage. The CRediT contributor role of &#8220;Data Curation,&#8221; for example \u2014 a taxonomy CASRAI originated in 2014 and which is now stewarded by NISO as ANSI\/NISO Z39.104-2022 \u2014 gives institutions a controlled vocabulary for naming who performed which custody step, complementing PROV-O&#8217;s more technical entity\/activity\/agent model. Research administrators building data management plans can pair the two: <a href=\"\/credit\/roles\/\">CRediT roles<\/a> for human accountability, PROV-DM for machine-actionable history.<\/p>\n<h2 id=\"faq\">Common Questions About Data Provenance<\/h2>\n<h3 id=\"what-is-a-data-provenance\">What is data provenance?<\/h3>\n<p><strong>Data provenance<\/strong> is the documented history of a dataset&#8217;s origin and custody \u2014 who collected it, under what method, and what transformations it underwent before use. It functions as a <strong>chain of custody<\/strong>, supporting authenticity checks, quality auditing, and reproducibility of any research output that relies on the data.<\/p>\n<h3 id=\"provenance-vs-lineage\">What is data provenance vs lineage?<\/h3>\n<p>Data lineage maps the technical route data takes between systems \u2014 extraction, transformation, loading. <strong>Data provenance<\/strong> adds the accountability layer: who authorised each step, why it occurred, and under what licence. Lineage is the operational map; <strong>provenance<\/strong> is the governance record built on top of it.<\/p>\n<h3 id=\"two-classes-of-provenance\">What are the two classes of data provenance?<\/h3>\n<p>Provenance literature typically distinguishes <strong>backward (retrospective) provenance<\/strong>, which reconstructs a dataset&#8217;s origin and history after the fact, from <strong>forward (prospective) provenance<\/strong>, which records how data is expected to move and transform in a defined future workflow before it happens.<\/p>\n<h3 id=\"what-does-provenance-mean\">What does provenance mean?<\/h3>\n<p>Outside data contexts, <strong>provenance<\/strong> refers to the documented history of ownership or origin of an object \u2014 the term used to authenticate artworks and manuscripts. Applied to research data, the same principle holds: a verifiable record of origin that supports trust, exactly as a chain of custody supports evidentiary trust in other domains.<\/p>\n<h2 id=\"implications\">Why Provenance Completes FAIR: Implications for Institutions<\/h2>\n<p>The FAIR data principles (Findable, Accessible, Interoperable, Reusable) are frequently treated as a checklist for open deposit, but the Reusable facet explicitly requires more than a licence tag. Principle R1.2 states that &#8220;(meta)data are associated with detailed provenance&#8221; \u2014 a sub-principle that is easy to satisfy nominally and hard to satisfy meaningfully. A dataset can be technically Findable and Accessible while its provenance metadata is a single free-text sentence, which leaves reproducibility unverifiable in practice.<\/p>\n<p>This gap matters more as scrutiny of dataset origin intensifies elsewhere. MIT Media Lab&#8217;s audit of over 1,800 AI training datasets found licence omission or miscategorisation in more than two-thirds of cases \u2014 a warning sign for any field, including research data management, that treats provenance as an afterthought rather than a captured-at-source discipline.<\/p>\n<p>For institutions building or refreshing data management plans under UKRI or Horizon Europe funding requirements, the practical implication is straightforward: provenance capture belongs at collection time, encoded against PROV-DM or an equivalent model, not reconstructed retrospectively when a journal, repository, or auditor asks for it. Research administrators, repository managers, and publishers who build custody-chain logging into their <a href=\"\/research-administration\/\">research administration<\/a> workflows now will find FAIR compliance \u2014 and reproducibility review \u2014 considerably less costly later.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Data provenance documents a dataset&#8217;s custody chain from collection to publication, closing the gap FAIR leaves for reproducibility.<\/p>\n","protected":false},"author":15,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_casrai_contributor_statement":"","_casrai_contributors_json":"","_article_doi":"","_article_license":[],"_article_funding":[],"_casrai_article_id":"","_casrai_registry_status":"","_casrai_registry_date":"","footnotes":""},"categories":[263],"tags":[181,1948,1947,1340,1950,353,1550,1949],"credit_role":[],"dictionary_domain":[],"class_list":["post-2888","post","type-post","status-publish","format-standard","hentry","category-analysis","tag-data-governance","tag-data-lineage","tag-data-provenance","tag-fair-data-principles","tag-provenance-standards","tag-research-data-management","tag-ro-crate","tag-w3c-prov"],"_links":{"self":[{"href":"https:\/\/casrai.org\/wp\/wp-json\/wp\/v2\/posts\/2888","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/casrai.org\/wp\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/casrai.org\/wp\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/casrai.org\/wp\/wp-json\/wp\/v2\/users\/15"}],"replies":[{"embeddable":true,"href":"https:\/\/casrai.org\/wp\/wp-json\/wp\/v2\/comments?post=2888"}],"version-history":[{"count":0,"href":"https:\/\/casrai.org\/wp\/wp-json\/wp\/v2\/posts\/2888\/revisions"}],"wp:attachment":[{"href":"https:\/\/casrai.org\/wp\/wp-json\/wp\/v2\/media?parent=2888"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/casrai.org\/wp\/wp-json\/wp\/v2\/categories?post=2888"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/casrai.org\/wp\/wp-json\/wp\/v2\/tags?post=2888"},{"taxonomy":"credit_role","embeddable":true,"href":"https:\/\/casrai.org\/wp\/wp-json\/wp\/v2\/credit_role?post=2888"},{"taxonomy":"dictionary_domain","embeddable":true,"href":"https:\/\/casrai.org\/wp\/wp-json\/wp\/v2\/dictionary_domain?post=2888"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}