Definition · Plain-language

Data lineage

Data lineage is the documented record of how data flows through an organisation — its origin, the transformations it undergoes and where it is ultimately consumed — making data trustworthy, auditable and traceable.

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.

What lineage captures

Data lineage documents the journey of data: the systems and datasets it originates in, the processing and transformations applied at each stage, and the reports, models or applications that consume it. It can be recorded at different levels of detail, from a high-level map of system-to-system flows down to column-level transformations. The result is a traceable chain that answers where a given value came from and what happened to it along the way.

Why lineage matters

Lineage is foundational to trust and accountability. When a figure in a report looks wrong, lineage lets you trace it back to source and find where the error entered. Before changing a system, lineage supports impact analysis — showing what downstream reports and processes a change would affect. For regulatory and audit purposes it provides an evidence trail. And as organisations build analytics and AI on their data, lineage records the provenance of inputs, which is increasingly required to demonstrate responsible and compliant use.

Lineage and provenance

Data lineage is closely related to data provenance: lineage emphasises the flow and transformations within and between systems, while provenance stresses the origin and authenticity of data. In research and AI contexts the two converge — both are about being able to demonstrate where data came from and how trustworthy it is. Maintaining accurate lineage usually depends on capturing operational metadata automatically, since manually documented flows quickly fall out of date as pipelines change.

Key facts

At a glance

Definition: the documented flow of data from source to consumption
Captures: origin, transformations and downstream usage
Granularity: from system-level maps to column-level detail
Key uses: trust, audit, impact analysis, AI data provenance
Related to: data provenance (origin and authenticity)
Best captured: automatically, as part of operational metadata

Common misconceptions

What people often get wrong

Often heard: Data lineage is only needed for regulatory compliance.

Actually: Compliance is one use, but lineage also enables impact analysis before changes, root-cause investigation of data errors and provenance for analytics and AI — all of which build everyday trust in data.

Often heard: Lineage can be documented once in a diagram and left alone.

Actually: Data pipelines change constantly. Manually maintained lineage quickly becomes inaccurate, so lineage is best captured automatically from operational metadata and kept continuously up to date.

Often heard: Lineage and provenance mean exactly the same thing.

Actually: They overlap but differ in emphasis. Lineage focuses on the flow and transformation of data between systems; provenance stresses the origin and authenticity of the data itself.

Going deeper

Related CASRAI guidance

Metadata management →Data catalog →Data quality →Data governance →Standards dictionary →Persistent identifiers →