Tag: FAIR data

Citing Data Properly: The Joint Declaration of Data Citation Principles
For decades, the data underpinning a study lived in a footnote, an appendix, or nowhere visible at all. A reader who wanted to inspect, reuse, or build on those data had little to go on. As research has become more data-intensive, that omission has grown harder to justify. The Joint Declaration of Data Citation Principles, published through FORCE11 in 2014, was a deliberate attempt to fix it by treating datasets as legitimate, citable research outputs in their own right.

Why data citation matters

Citing data is not merely good manners. It serves the same purposes as citing the literature: it credits the people who produced the work, it lets readers verify claims, and it builds a traceable record of how knowledge accumulates. When a dataset is cited formally, the citation can be counted, indexed, and linked, which means the often-considerable labour of collecting, cleaning, and documenting data becomes visible and rewardable. This connects directly to broader efforts in FAIR data, where the goal is for data to be findable, accessible, interoperable, and reusable.

The eight principles

The Declaration is built around eight principles that, taken together, describe what responsible data citation looks like:
- Importance. Data should be considered legitimate, citable products of research, deserving the same status as publications.
- Credit and attribution. Citations should give scholarly credit and normative, legal attribution to everyone who contributed to the data.
- Evidence. Where a claim rests on data, the corresponding data should be cited.
- Unique identification. Citations should include a persistent, machine-actionable, globally unique identifier.
- Access. Citations should make it possible to reach the data themselves and their associated metadata and documentation.
- Persistence. Identifiers and metadata should persist even beyond the lifespan of the data they describe.
- Specificity and verifiability. Citations should allow a precise version and subset of the data to be identified.
- Interoperability and flexibility. Citation methods should work across communities while accommodating disciplinary differences.
These principles are intentionally technology-neutral. They do not mandate a single repository or identifier scheme; they describe outcomes that any sound practice should achieve.

How to cite a dataset in practice

A well-formed data citation looks much like a reference to an article, but with a few additions. At minimum it should carry the creator or creators, the year of publication, the title of the dataset, the publisher or repository, a version where one exists, and a persistent identifier. In most cases that identifier is a DataCite DOI, resolvable to a landing page that describes the dataset and points to the files. A typical reference takes the shape: Creator(s) (Year): Title. Version. Publisher. Dataset. DOI.

Two details repay attention. First, versioning is not optional for datasets that change over time. Citing the specific version used means a future reader can reproduce exactly what was analysed, rather than a later, possibly different, release. Second, the identifier should appear in the reference list, not merely in the running text. Burying a dataset DOI in a sentence keeps it out of the indexing and counting systems that make citation meaningful in the first place.

DataCite DOIs and the reference list

DataCite was established precisely to assign DOIs to research data and to maintain the metadata that makes those DOIs useful. When a repository mints a DataCite DOI for a dataset, it registers structured metadata describing the creators, title, publication year, resource type, and related identifiers. That metadata is what allows discovery services and reference managers to handle data citations the way they handle article citations. Placing the DOI in the reference list, formatted to the relevant style, lets indexing infrastructure pick it up and attribute it correctly.

Data availability statements close the loop

Many publishers now require a data availability statement, a short passage telling readers where the underlying data can be found and under what conditions. Done well, the statement names the repository and gives the persistent identifier, linking the prose of the article to the formal citation in the reference list. Done poorly, it says only that data are available on request, which research has repeatedly shown to be an unreliable route to access. A good availability statement and a properly formatted data citation are two halves of the same commitment: that the evidence behind a study can actually be found and reused.

Bringing it together

The Joint Declaration did not invent the idea that data deserve credit, but it gave the community a shared, citable reference point. The practical implications are modest and achievable: assign a persistent identifier, capture the version, put the citation in the reference list, and write a data availability statement that points to it. Standards bodies and metadata schemas, including the work catalogued in the CASRAI data dictionary and contributor frameworks such as CRediT, give the surrounding vocabulary to describe who did what. The principles themselves are a reminder that data are not a by-product of research but, increasingly, one of its most valuable outputs.
June 21, 2026

Big Data and the Vs of Data Explained for Research

Big data refers to datasets so large, fast-moving or varied that traditional database tools cannot capture, store or analyse them within a reasonable time. It is defined less by an exact size threshold than by a set of characteristics, usually summarised as the “Vs”, and by the distributed computing methods needed to process it. In research, big data spans genomics, sensor networks, clinical records, social media and large-scale simulations.

The defining Vs of big data

The concept began with three Vs and has since expanded. The table below sets out the five most widely cited.

Characteristic	Meaning	Research example
Volume	The sheer quantity of data, from terabytes to petabytes and beyond	Whole-genome sequencing across cohorts
Velocity	The speed at which data is generated and must be processed	Real-time readings from environmental sensors
Variety	The mix of formats: structured, semi-structured and unstructured	Combining tables, images, text and audio
Veracity	The trustworthiness, accuracy and completeness of the data	Cleaning noisy or missing clinical records
Value	The usefulness of insights that can be extracted	Identifying disease risk factors at scale

Volume, velocity and variety were the original three, capturing the scale, speed and heterogeneity that overwhelm conventional tools. Veracity was added to stress that more data is not automatically better data; noise, bias and gaps must be managed. Value reminds us that the point of all this effort is actionable insight, not collection for its own sake.

Distributed processing: how big data is handled

No single machine can hold or analyse a petabyte efficiently, so big data relies on distributed processing: spreading storage and computation across clusters of many machines that work in parallel. The foundational pattern was MapReduce, which splits a task into pieces, processes them across nodes, then combines the results. Frameworks such as Apache Hadoop and, later, Apache Spark made this approach mainstream, with Spark adding in-memory processing for far greater speed. Cloud platforms now offer this elasticity on demand, letting researchers scale resources to the dataset rather than the other way round.

Big data in research, and its pitfalls

Used well, big data lets researchers detect patterns invisible at small scale, model complex systems and test hypotheses across enormous samples. But scale brings risks. Large datasets can be biased or unrepresentative despite their size, and the volume can lull analysts into ignoring how the data was collected. Crucially, big data does not suspend statistical thinking: with millions of observations, almost any difference becomes statistically significant, which is exactly why effect size matters more than ever, and why a small p-value on its own means little. Big data also fuels machine learning, where larger samples help guard against the overfitting that plagues models trained on too little.

Big data and FAIR principles

The promise of big data depends on the data being usable, and that is where the FAIR principles, that data should be Findable, Accessible, Interoperable and Reusable, become essential. Findability requires rich metadata and persistent identifiers. Interoperability requires shared vocabularies, the kind standardised in the CASRAI dictionary, so that varied sources can be combined meaningfully. Reusability requires clear provenance and licensing. Without these foundations, a large dataset is merely a large liability. Our broader work on standards and metadata, including our guidance for authors and our reproducibility coverage, sets out how to make big research data dependable rather than just big.

Frequently asked questions

How big does data have to be to count as big data?

There is no fixed size. Big data is defined by characteristics, the Vs, rather than a threshold. The practical test is whether traditional tools struggle to store or process it within a useful timeframe.

What are the original three Vs?

Volume, velocity and variety: the scale of the data, the speed at which it arrives, and the diversity of its formats. Veracity and value were added later to address quality and usefulness.

Why is veracity important?

Because size does not guarantee quality. Large datasets can contain errors, bias, duplicates and missing values. Veracity emphasises assessing and improving trustworthiness before drawing conclusions.

How does big data relate to FAIR data?

FAIR principles make big data usable by ensuring it is Findable, Accessible, Interoperable and Reusable. Shared vocabularies and persistent identifiers, such as those in the CASRAI dictionary, let varied large datasets be combined and reused reliably.

June 20, 2026

Digital sustainability: the environmental cost of data storage and preservation

The instinct in modern research is to keep everything. Storage is cheap, deletion feels risky, and the principles of openness and reproducibility seem to counsel retaining as much as possible for as long as possible. But this instinct conceals a real and growing cost. Storing data, running computations and preserving digital material for the long term all consume energy, and energy carries a carbon footprint. The cloud is not a weightless abstraction; it is data centres drawing power and demanding cooling, somewhere, continuously. As research becomes ever more data-intensive, the environmental cost of its digital life — storage, computation, preservation — can no longer be treated as invisible. Digital sustainability is the discipline of taking that cost seriously, and it is the subject of this article, which draws on the sustainable-research domain of the CASRAI Dictionary.

The hidden cost of keeping everything

The first thing digital sustainability asks us to see is that “keep it just in case” is not a cost-free default. Every dataset retained indefinitely occupies storage that must be powered, cooled, maintained, migrated to new media over time, and backed up — and the aggregate of countless such decisions across the research system is substantial. There is a real tension here with the open-data ideal. The drive to make data findable and reusable is valuable, but it can shade into digital hoarding: keeping vast quantities of low-value data on the vague principle that more is always better, without asking whether a dataset is worth its ongoing cost. The FAIR principles call for data to be findable and reusable — not for everything to be kept forever regardless of value. Distinguishing data worth preserving from data that need not be is itself an act of stewardship, not a betrayal of openness.

Appraisal and data minimisation

The practices that respond to this are appraisal and data minimisation. Appraisal — long established in the archival and records-management traditions — is the disciplined process of deciding what to keep, for how long, and what may responsibly be discarded, based on enduring value rather than reflex. Data minimisation, familiar also from data protection, is the principle of collecting and retaining only what is genuinely needed. Applied to research, these practices mean making conscious decisions: which raw data must be preserved to support published results and which intermediate files can be regenerated if ever needed; which datasets have lasting reuse value and which were transient. This is not an argument for carelessly deleting valuable data — the cost of losing irreplaceable data far exceeds the cost of storing it. It is an argument for deciding, deliberately and well, rather than defaulting to indiscriminate retention. Good appraisal keeps what matters and lets go of what does not, serving both sustainability and the long-term usability of the record.

Green software and computation

Storage is only part of the picture; computation has its own footprint. The green software movement — advanced by organisations such as the Green Software Foundation — aims to reduce the environmental impact of software itself. A central concept is Software Carbon Intensity (SCI), a specification for measuring the carbon emissions associated with running software, so that the impact can be quantified, compared and reduced rather than guessed at. For research, the principles translate into practical questions: is a computation more efficient than it needs to be; is it run repeatedly when results could be cached; is the workload run where and when the energy is cleaner? Efficient, well-considered computation is not only cheaper and faster but less carbon-intensive, and measuring impact, as SCI encourages, is the precondition for managing it.

Preservation that lasts: OAIS

Sustainability is not only about using less; it is also about preserving well, so that what is kept genuinely endures and the energy spent keeping it is not wasted. The reference model for long-term digital preservation is OAIS — the Open Archival Information System reference model — which provides a framework for what a trustworthy digital archive must do to preserve information over the long term and keep it accessible and understandable to future users. OAIS matters to digital sustainability in two ways. First, preservation is itself an ongoing activity with an environmental cost, and doing it according to a sound model means that cost buys real durability rather than slow decay. Second, preserving fewer things well — properly described, in sustainable formats, in a trustworthy archive — is far better, environmentally and intellectually, than preserving many things badly, where data accumulates and yet quietly becomes unusable through neglect. Good preservation and disciplined appraisal are two sides of the same sustainable practice.

Sustainability and FAIR, properly understood

None of this is in conflict with FAIR or with open research, properly understood. FAIR is about good stewardship — making the data that is worth keeping findable, accessible, interoperable and reusable — not about hoarding. A sustainable approach is, in fact, a more honest expression of FAIR: it concentrates effort on the data that genuinely merits it, rather than spreading thin attention and real resources across everything indiscriminately. Sustainability and good data stewardship point in the same direction: keep what matters, describe it well, preserve it properly, and let go of what does not earn its keep.

A consistent vocabulary for digital sustainability

For sustainable practice to be applied consistently — across repositories, institutions and funders — the concepts involved, such as retention periods, appraisal decisions, preservation levels and format requirements, must be described in ways that mean the same thing everywhere. That consistency is what the CASRAI Dictionary works towards: a shared vocabulary so that decisions about what to keep, how to preserve it and for how long are understood the same way wherever they are recorded. And because appraising, curating and preserving data well is genuine, skilled work, it can be described in the same shared framework as any other contribution — the CRediT taxonomy and the wider apparatus of research administration. The most sustainable digital research is not the research that stores the least, but the research that decides most carefully what is worth keeping — and then keeps it well.

June 19, 2026

Genomic Data-Sharing Standards: GA4GH and Responsible Access Explained

Genomic data sharing is the responsible exchange of genetic data between researchers and repositories using common standards for file formats, metadata, consent and access control. Because genetic data is sensitive and richly structured, sharing it usefully depends on agreed technical standards and clear governance rather than ad-hoc file transfers.

This article describes how genetic and genomic data is shared from a data-standards and governance perspective. It is not clinical genetics advice; the focus throughout is notation, metadata, interoperability and access frameworks.

The Global Alliance for Genomics and Health

The Global Alliance for Genomics and Health (GA4GH) is an international standards organisation that develops frameworks and technical specifications to enable responsible genomic data sharing. Its work spans both governance — such as consent and data-access policy frameworks — and technical interoperability standards that allow systems to exchange genomic data and query it consistently.

The value of a shared standards body is that institutions in different countries can align on common interfaces and metadata conventions, so a dataset described and stored according to GA4GH-aligned conventions can be discovered and accessed by authorised researchers elsewhere. Controlled vocabularies underpinning these descriptions are the kind of structured terms recorded in the CASRAI dictionary.

FAIR principles in a genomics context

Genomic data sharing is closely aligned with the FAIR principles: data should be findable, accessible, interoperable and reusable. In genomics, “accessible” does not mean open to everyone; it means accessible under clearly defined and machine-readable conditions, which often include authorisation and consent checks.

FAIR principle	Genomics interpretation
Findable	Datasets carry persistent identifiers and rich, searchable metadata
Accessible	Access is defined by clear, often controlled, machine-readable conditions
Interoperable	Standard formats and shared vocabularies allow systems to exchange data
Reusable	Consent terms, provenance and licensing are documented for re-analysis

Consent, controlled access and data archives

Much genetic data is held in controlled-access archives rather than fully open repositories. Under this model, descriptive metadata may be openly browsable while the underlying genetic data is released only to researchers whose project and credentials have been reviewed and approved by a data-access committee.

Consent is the cornerstone of this governance. The terms under which data was originally collected determine how it may later be shared and reused, so consent metadata must travel with the data. This makes documented provenance — who collected the data, under what consent, and with what permitted uses — an essential part of responsible sharing.

File and metadata formats

Interoperability in genomics rests on standardised file formats for sequence reads and variants, paired with structured metadata describing the sample, the experiment and the access conditions. Consistent formats let independent groups validate, re-align and re-analyse data, supporting the goals discussed across our reproducibility coverage. Persistent identifiers tie datasets to their originating studies and contributors, as explained in our note on persistent identifiers in 2026.

The same emphasis on stable identifiers and structured notation appears when recording protein information; see our companion guide on amino acids and protein data notation. For broader context, browse our data-infrastructure news and the guidance for authors on describing datasets.

Frequently asked questions

What is GA4GH?

The Global Alliance for Genomics and Health is an international standards organisation that develops governance frameworks and technical specifications to enable responsible genomic data sharing across institutions and borders.

Does sharing genomic data mean making it openly available to everyone?

No. Responsible sharing usually means controlled access: descriptive metadata may be browsable, but the underlying genetic data is released only to authorised researchers whose projects and credentials have been reviewed and approved.

How do FAIR principles apply to genetics data?

FAIR principles require genetic data to be findable through persistent identifiers and metadata, accessible under clearly defined conditions, interoperable through standard formats, and reusable with documented consent, provenance and licensing.

Why does consent metadata matter for data sharing?

Consent determines the permitted uses of data. Because those terms govern future reuse, consent and provenance information must accompany the data so that downstream researchers only use it within the agreed conditions.

June 18, 2026

Amino Acids: Notation, Protein Data and How Sequences Are Recorded

Amino acids are small organic molecules that join together in chains to build proteins, and the 20 standard amino acids form the common alphabet used to write and share protein sequence data. Each amino acid carries a standard one-letter and three-letter abbreviation, giving researchers an unambiguous notation for recording sequences in databases, publications and data-exchange formats.

From a data-infrastructure perspective, amino acids matter less as chemistry and more as a controlled vocabulary: a fixed set of symbols that lets sequence information move reliably between laboratories, repositories and software tools without loss of meaning.

The 20 standard amino acids and their notation

Proteins are built from 20 standard amino acids, each of which has a residue name, a three-letter code and a single-letter code. The single-letter codes are the backbone of compact sequence notation, allowing a protein of several hundred residues to be written as one continuous string of letters.

Amino acid	Three-letter	One-letter
Alanine	Ala	A
Glycine	Gly	G
Leucine	Leu	L
Serine	Ser	S
Tryptophan	Trp	W
Tyrosine	Tyr	Y

The full set of 20 covers residues such as alanine, arginine, asparagine, aspartate, cysteine, glutamate, glutamine, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine and valine. Standardised codes mean that a sequence recorded in one system is read identically in another, which is the foundation of interoperable protein data. Consistent notation of this kind is exactly the type of controlled term documented in the CASRAI dictionary.

From sequence to structure: how protein data is recorded

Protein data exists at two complementary levels. Sequence data describes the linear order of amino-acid residues, while structure data describes the three-dimensional arrangement of atoms once the chain has folded. Both layers need stable identifiers and agreed formats so that records remain findable and reusable over time.

Sequence records are commonly written in FASTA format, a plain-text convention in which a header line carries an identifier and the following lines hold the one-letter residue string. Structure records use formats associated with atomic coordinates, capturing the position of each atom rather than only the residue order.

UniProt and the Protein Data Bank

Two long-standing resources anchor protein data sharing. UniProt is a comprehensive, curated repository of protein sequence and functional information, assigning persistent accession identifiers to protein entries. The Protein Data Bank (PDB) is the established archive for experimentally determined three-dimensional structures of proteins and other biological macromolecules.

Resource	Primary content	Identifier role
UniProt	Protein sequences and functional annotation	Stable accession per protein entry
Protein Data Bank (PDB)	3D structural coordinates	Stable entry identifier per structure

Both resources illustrate good practice for research data infrastructure: persistent identifiers, structured metadata and open access to underlying records. Linking a sequence accession to a structure entry creates a navigable web of evidence, much as persistent identifiers connect outputs across the wider scholarly record described in our overview of persistent identifiers in 2026.

Why standard notation supports reproducibility

Because the amino-acid alphabet is fixed and the abbreviations are standardised, protein data aligns naturally with the FAIR principles — findable, accessible, interoperable and reusable. A sequence written in standard one-letter notation can be searched, aligned and compared across repositories without manual reconciliation, and a structure deposited with rich metadata can be revisited by independent researchers. This connects protein data to the broader agenda covered in our reproducibility news, and to the related question of how genomic data is shared responsibly in genomic data-sharing standards explained. For practical guidance on citing and describing such resources, see our guidance for authors.

Frequently asked questions

How many standard amino acids are there?

There are 20 standard amino acids that serve as the common building blocks of proteins. Each has an agreed three-letter and one-letter abbreviation, forming a fixed alphabet for recording and sharing sequence data.

What is the difference between one-letter and three-letter amino-acid codes?

Three-letter codes such as Ala or Gly are readable abbreviations often used in text and structural records, while one-letter codes such as A or G create compact sequence strings ideal for databases and alignment software. Both refer to the same residues.

What do UniProt and the PDB store?

UniProt stores curated protein sequence and functional information with stable accession identifiers, while the Protein Data Bank stores experimentally determined three-dimensional structures with their own persistent entry identifiers. Together they cover the sequence and structure layers of protein data.

How do amino-acid standards support FAIR data?

A fixed notation and well-described repository records make protein data findable, accessible, interoperable and reusable. Standard codes remove ambiguity, so sequences and structures can be exchanged and compared across systems without loss of meaning.

June 18, 2026

FAIR Principles for Research Data Explained

FAIR data refers to research data managed according to four guiding principles — Findable, Accessible, Interoperable and Reusable — designed to maximise the value of data for both humans and machines. The principles were set out by Mark Wilkinson and colleagues in a landmark 2016 paper in Scientific Data and have since been adopted widely by funders, publishers and research institutions as a benchmark for good data stewardship. FAIR describes how data should be described, shared and preserved so that it can be discovered and reused long after a project ends.

A common misconception is that FAIR means “open”. It does not. FAIR is about good management and clear conditions of use; data can be FAIR while access remains controlled, which matters for sensitive or personal data.

What each principle means

The four principles work together, and the order spells the acronym rather than a strict sequence. Each rests heavily on metadata and persistent identifiers.

Principle	Core idea	Key enablers
Findable	Data and metadata are easy to locate by humans and machines	Persistent identifiers (e.g. DOIs), rich metadata, indexing
Accessible	Once found, data can be retrieved by a clear, open protocol	Standard protocols; metadata stays available even if data are restricted
Interoperable	Data can be combined and used with other data and systems	Shared vocabularies, standard formats, controlled terminologies
Reusable	Data are richly described and licensed for reuse	Clear licences, provenance, community standards and metadata

Findable requires that data and metadata carry globally unique, persistent identifiers and are described well enough to be indexed and searched. Accessible means the data can be retrieved using a standardised, open communication protocol, with authentication where needed — and, importantly, that metadata remain accessible even when the underlying data are not. Interoperable calls for data to use shared, standard formats and vocabularies so they can be integrated with other datasets and processed by different systems. Reusable requires rich description, clear provenance and an explicit usage licence so others can confidently build on the data.

The role of persistent identifiers and metadata

Two enablers run through all four principles: persistent identifiers and metadata. A persistent identifier — such as a DOI for a dataset or an ORCID for a researcher — provides a stable, resolvable reference that does not break when URLs change, underpinning findability and provenance. Metadata — structured information describing what the data are, how they were produced, and under what terms they may be used — is what makes data discoverable, interpretable and reusable. Crucially, FAIR treats metadata as valuable in its own right: rich, standardised metadata can remain open and findable even when the dataset itself is access-controlled. This is precisely the kind of standardised description that shared vocabularies, such as the CASRAI dictionary, and broader data infrastructure are built to support.

FAIR versus open

FAIR and open are related but distinct. Open data is data anyone can freely access, use and redistribute. FAIR data is well-managed, well-described data with clear access conditions — which may or may not be open. The principles’ own phrasing, “as open as possible, as closed as necessary”, captures the balance: maximise reuse while respecting legitimate constraints such as privacy, consent, commercial sensitivity or indigenous data rights. A dataset of patient records can be made FAIR — richly described, identified, governed and licensed — without being openly downloadable. Conversely, dumping a file online makes it open but not necessarily FAIR if it lacks identifiers, metadata or a licence.

For researchers, adopting FAIR practice means assigning identifiers, writing good metadata, using standard formats and stating licences from the outset rather than at the end of a project. Guidance on preparing and describing data is available in our resources for authors, and FAIR data underpins the reproducibility goals discussed across our research-outputs coverage.

Frequently asked questions

What does FAIR stand for?

FAIR stands for Findable, Accessible, Interoperable and Reusable. The four principles, published by Wilkinson and colleagues in 2016, describe how research data and metadata should be managed so they can be discovered, retrieved, combined and reused effectively by both humans and machines.

Does FAIR mean the same as open data?

No. Open data can be freely accessed and reused by anyone, whereas FAIR data is well-described and well-managed with clear access conditions that may be restricted. The guiding phrase is “as open as possible, as closed as necessary”, so sensitive data can still be FAIR.

Why are persistent identifiers important for FAIR data?

Persistent identifiers such as DOIs and ORCIDs provide stable, resolvable references that do not break when web addresses change. They underpin findability and provenance, letting data, researchers and outputs be reliably located and credited over the long term.

Can data be FAIR without being publicly downloadable?

Yes. FAIR requires clear access protocols and rich metadata, not unrestricted access. Metadata can remain findable and accessible even when the underlying dataset is controlled, so sensitive datasets can be made FAIR while access stays appropriately governed.

June 18, 2026

Te Mana Raraunga: Māori data sovereignty as a regional model for Indigenous data governance
The global conversation about Indigenous data governance has, in recent years, found a powerful shared language in the CARE Principles for Indigenous Data Governance — Collective benefit, Authority to control, Responsibility and Ethics. CARE provides an internationally recognised frame, articulated by the Global Indigenous Data Alliance (GIDA), that positions Indigenous peoples’ rights and interests at the centre of how data about them is governed. But principles at the global level are realised in particular places, by particular peoples, grounded in particular relationships and legal traditions. One of the most developed expressions of Indigenous data sovereignty anywhere is Te Mana Raraunga, the Māori Data Sovereignty Network in Aotearoa New Zealand, whose work shows how a regional model can give CARE concrete meaning while standing on foundations all its own. This article examines that model, drawing on the Indigenous data and CARE domain of the CASRAI Dictionary.

What Te Mana Raraunga is

Te Mana Raraunga is a network advocating for Māori rights and interests in data — for Māori data sovereignty. The phrase te mana raraunga itself speaks to the authority and integrity that attach to data, and the network exists to assert that Māori, as the people to whom much of that data relates, have legitimate rights of governance over it. Its concerns span how Māori data is collected, who controls it, how it is used, and whether its use serves Māori aspirations or merely extracts from Māori communities. The network has been instrumental in defining what Māori data sovereignty means in practice and in pressing institutions to recognise and respect it. It represents not an abstract ideal but an organised, articulate movement with a developed body of principles.

Grounded in Te Tiriti o Waitangi

What distinguishes the Māori model, and makes it more than a local application of a global frame, is its foundation in Te Tiriti o Waitangi — the Treaty of Waitangi, the founding constitutional document of Aotearoa New Zealand. Te Tiriti establishes a relationship between Māori and the Crown and affirms Māori authority over their own affairs and treasured things. Māori data sovereignty draws directly on this: if Māori hold authority over their taonga (treasures) and their own domains, then data about Māori — their people, lands, language and knowledge — falls within the scope of that authority. This gives Māori data sovereignty a distinctive constitutional grounding that an appeal to general principle does not: the argument is not only ethical but rests on a recognised relationship and a foundational agreement, which lends the model particular force.

How the regional model complements CARE

The relationship between Te Mana Raraunga and the global CARE frame is complementary, not competitive, and understanding how illuminates Indigenous data governance more broadly.
- CARE provides the shared language. Collective benefit, authority to control, responsibility and ethics give a vocabulary recognisable across borders and useful for engaging international infrastructures and institutions.
- The regional model provides the grounding. Te Tiriti gives Māori data sovereignty a specific constitutional foundation and a specific people whose authority is being asserted, turning general principle into concrete, situated claim.
- Each strengthens the other. The global frame lends regional movements international recognition and solidarity; regional models like the Māori one give the global principles tested, real-world expression and demonstrate that they can be operationalised.
This is why Indigenous data sovereignty is best understood as a family of grounded movements connected by shared principles, rather than a single uniform doctrine.

Distinct from other Indigenous data frameworks

It is important not to flatten the diversity of Indigenous data governance into one model. The Māori approach is distinct, for example, from Canada’s widely cited OCAP principles — Ownership, Control, Access and Possession — developed by and for First Nations in a different constitutional and historical context. OCAP and Māori data sovereignty share a commitment to Indigenous authority over Indigenous data, but they arise from different peoples, different legal foundations and different histories, and they are expressed differently. Recognising this matters: Indigenous peoples are not interchangeable, and good practice does not lift a framework from one context and impose it on another. The right model is the one grounded in the rights, relationships and aspirations of the specific people concerned. The global CARE frame accommodates this diversity precisely because it sets principles rather than prescribing a single mechanism.

CARE alongside FAIR

Indigenous data sovereignty also reshapes how the familiar FAIR principles — Findable, Accessible, Interoperable, Reusable — are understood. FAIR is concerned with the technical qualities that make data useful and reusable, but it is largely silent on questions of power, consent and benefit. CARE, and grounded models like Te Mana Raraunga, supply exactly what FAIR leaves out: who decides, who benefits, and on whose authority data is used. The two are meant to operate together — data can be FAIR and CARE at once — but where they pull in different directions, the Māori model is clear that authority and collective benefit are not negotiable conveniences. Making data maximally open is not a virtue if it overrides the rights of the people the data concerns.

A consistent vocabulary for grounded governance

For Indigenous data governance to be respected across the systems that hold and share research data, the terms involved — consent conditions, governance authority, access and benefit arrangements, provenance — must be described in ways that carry their meaning faithfully wherever the data travels. That consistency is part of what the CASRAI Dictionary works towards: a shared vocabulary so that a governance condition asserted by a community is not quietly lost when its data moves between systems. And because stewarding Indigenous data and partnering with the communities it concerns is genuine, recognisable work, it can be described in the same shared framework as any other contribution — the CRediT taxonomy and the wider apparatus of research administration. Te Mana Raraunga shows what Indigenous data sovereignty looks like when it is grounded in a people’s own authority; the global principles show how such grounded models can speak to one another and to the world.
June 16, 2026
Licensing research data: CC-BY, CC0 and when to use each
You can deposit a dataset in a trusted repository, describe it with rich metadata, and give it a DOI — and still leave it effectively unusable, because you forgot the one line that tells a reuser what they are allowed to do with it. A dataset without a clear licence is data nobody can confidently build on: a careful researcher, unsure of the terms, will simply not reuse it. Licensing is therefore not a legal afterthought but the part of the data-infrastructure domain that determines whether a deposit delivers the “R” in FAIR at all. This guide explains the main choices — principally CC0 and CC BY — and when each fits.

Why a licence is the reusability switch

The FAIR principles ask that data be Findable, Accessible, Interoperable, and Reusable — and reusability rests explicitly on data being “released with a clear and accessible data usage licence”. Without a licence, default copyright and database rights leave the legal status ambiguous, and ambiguity is fatal to reuse: a would-be user cannot tell whether combining your data with theirs, redistributing it, or building a tool on it is permitted. An explicit, standard, machine-readable licence resolves that uncertainty in advance, for everyone, without anyone having to ask. That is why “attach an explicit licence” is the step that turns a findable dataset into a reusable one.

The two main choices for data

CC0 — the public-domain dedication

CC0 is a Creative Commons tool by which the rights-holder waives, to the fullest extent the law allows, all copyright and related rights in the work — placing it as close to the public domain as possible. For data, CC0 means a reuser can use, combine, modify, and redistribute the data with no conditions at all, including no obligation to attribute. This is widely recommended as the default for research data, and for a specific reason: data are routinely aggregated from many sources, and attribution requirements that stack up across hundreds of datasets (“attribution stacking”) can become legally and practically unworkable. CC0 removes that friction entirely and maximises interoperability. Several major data repositories and infrastructures apply CC0 by default for exactly this reason.

Importantly, CC0 waives legal requirements, not scholarly norms. Citing the data you use remains an academic and ethical expectation regardless of the licence — CC0 simply means that expectation is enforced by the norms of good scholarship rather than by copyright law.

CC BY — attribution required

CC BY permits the same broad reuse — use, adaptation, redistribution, including commercially — but on the single condition that the original creator is credited. For data, CC BY is appropriate where attribution matters enough to be a legal condition, or where a funder or institution requires it. It is the most permissive of the conditional Creative Commons licences and is the default for many open-access publications. The trade-off relative to CC0 is precisely the attribution clause: it guarantees credit, but it reintroduces the attribution-stacking problem when many datasets are combined.

Choosing between them
- Prefer CC0 for data intended for the widest possible aggregation and reuse, especially where the data will be merged with many other sources. It maximises interoperability and removes legal friction; rely on citation norms for credit.
- Choose CC BY where attribution must be a legal condition, where a funder or repository mandates it, or where the dataset is a discrete, citable product whose creators need enforceable credit.
- Be cautious with more restrictive clauses. Non-commercial (NC) and No-Derivatives (ND) terms substantially limit reuse and can render data incompatible with other open data; they are generally discouraged for research data unless a specific ethical or legal constraint demands them.
Data are not software: a critical caveat

Creative Commons licences are designed for content — text, images, and data — and Creative Commons itself advises against using them for software. Software has needs that CC licences do not address: patent grants, the distinction between source and compiled code, and copyleft mechanics. For code, use a recognised software licence instead — a permissive one such as MIT, BSD, or Apache 2.0, or a copyleft one such as the GPL. If your deposit bundles a dataset and the code that processes it, licence each part appropriately: a CC licence (or CC0) for the data, an OSI-approved software licence for the code. Conflating the two is one of the most common licensing mistakes in research deposits.

A practical checklist
1. Confirm you have the right to licence the data. Check funder terms, any data-sharing agreements, third-party data within your dataset, and — for personal or sensitive data — consent and governance constraints. A licence cannot grant rights you do not hold.
2. Default to CC0 for data unless there is a positive reason to require attribution; choose CC BY where there is.
3. Licence software separately with an OSI-approved licence; never put code under a Creative Commons licence.
4. State the licence explicitly in the deposit metadata and in any data availability statement, using the standard licence identifier so it is machine-readable.
5. Cite the data you reuse regardless of its licence — the scholarly norm holds even when the law does not require it.
How this connects to contribution and credit

Licensing answers “what may be done with this output?”; it is a sibling of the question “who made it?”, which the CRediT taxonomy answers. A dataset’s intellectual work is recorded on the associated paper through roles such as Data curation and Investigation, while the licence governs downstream reuse of the artefact itself. Used together — a clear licence on the data and clear contribution roles on the people — they ensure both the dataset and its creators are properly accounted for.

Where shared vocabulary fits

“CC0”, “CC BY”, “public domain”, “attribution”, and “reuse” are interpreted differently across repositories and funders, which undermines the very interoperability that licensing is meant to enable. A shared, federated vocabulary that defines these terms precisely — pointing back to Creative Commons for the licences and to the FAIR principles for the reusability requirement — is what lets a licence chosen for one repository be understood correctly in another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the data-infrastructure domain.

Related reading
June 15, 2026
FAIR data in practice: making research data findable and reusable
“FAIR data” is one of the most cited phrases in modern research data management, and one of the most frequently misunderstood. It is invoked in funder policies, journal requirements, and data management plans, often as a synonym for “put it online” — which is not what it means. Understanding what Findable, Accessible, Interoperable and Reusable actually require, in practice, is what turns FAIR from a slogan into a set of concrete actions. It is a foundational concern of the data-infrastructure domain, and the companion explainer on what FAIR data is sets out the background; this article is about doing it.

Where FAIR came from, and what it is for

The FAIR Guiding Principles for scientific data management and stewardship were published in 2016 by a broad group of researchers, publishers, and funders, and have since been adopted across the research landscape. Their purpose is specific and worth holding onto: FAIR is about making data usable by machines as well as people. A dataset that a human can eventually make sense of after emailing the author is not FAIR; a dataset that automated systems can find, access, combine, and reuse with minimal human intervention is. The principles describe four properties a dataset and, crucially, its metadata should have.

The four principles, in practice

Findable. Data cannot be reused if it cannot be found. In practice this means the dataset is deposited somewhere with a search index, is described by rich metadata, and — the linchpin — is assigned a globally unique, persistent identifier, typically a DOI. The metadata should be indexed and searchable, and should itself record the identifier. Findability is the property a hard drive or a personal website fundamentally cannot provide; a trusted repository is what supplies it.

Accessible. Once found, the data — or at least its metadata — must be retrievable through a standard, open protocol. Accessibility is the principle most often misread, so it is worth being precise: it does not mean the data must be open to all. It means the conditions of access are explicit and the retrieval mechanism is standardised. Sensitive data may be available only under controlled access, with an authentication and authorisation procedure — and that is still FAIR, provided the rules are clear and the metadata remains accessible even when the data themselves are not. The principle also asks that metadata persist even after the data are no longer available, so that a record of what existed survives.

Interoperable. Data are interoperable when they can be combined with other data and processed by tools without bespoke translation. In practice this means using standard, open file formats rather than proprietary ones; using shared vocabularies, ontologies, and standards to describe variables, so that “sex” or “temperature” mean the same thing across datasets; and including qualified references to other data and metadata, so a dataset declares its relationships rather than leaving them implicit. Interoperability is what lets datasets be aggregated and analysed at scale.

Reusable. The ultimate goal. For data to be genuinely reusable, three things are needed: a clear, accessible licence that states what may be done with the data; rich provenance describing where the data came from and how they were processed; and documentation that meets the relevant community standards, so a new user can understand the data well enough to use them correctly. A dataset with no licence is, in practice, not reusable — a cautious researcher will not build on data whose terms are unstated.

FAIR is not the same as open

The single most important clarification is this: FAIR is not a synonym for open. The principles are deliberately silent on whether data must be free to all; they are about how data are described, identified, and licensed, not about removing all access controls. This is precisely what makes FAIR workable for sensitive data — clinical, personal, commercially confidential, or culturally protected — that cannot ethically or legally be made fully open. Such data can be made findable through open metadata and a DOI, accessible under a documented controlled-access procedure, interoperable through standards, and reusable under an explicit licence. The watchword from the principles is “as open as possible, as closed as necessary.” Conflating FAIR with open leads people either to over-share data they should protect, or to dismiss FAIR as impossible for their field; both are mistakes.

A practical path to FAIR
1. Deposit in a trusted repository — a generalist repository such as Zenodo, Figshare, or Dryad, or a discipline-specific one — rather than a lab server or “available on request.” This delivers findability and a persistent identifier in one step.
2. Write rich metadata. Describe what the data are, how and when they were collected, and what each variable means, using community standards and vocabularies where they exist. The metadata is what machines read; thin metadata is the most common FAIR failure.
3. Use open, standard formats in preference to proprietary ones, so the data can be opened and combined without specialist software.
4. Apply an explicit licence. State clearly what may be done with the data; without this, the dataset is not reusable however well it scores on the other principles.
5. Record provenance and version. Document the data’s origin and processing, and pin versions so that a citation can identify exactly what was used.
6. Set access deliberately. Open where you can, controlled where you must — and keep the metadata accessible either way.
Crediting the work behind FAIR data

Making data FAIR is itself substantial, skilled labour — curating, documenting, standardising, and stewarding a dataset is real intellectual work that too often goes unrecognised. Contributor-role metadata can record it: the CRediT taxonomy includes a dedicated Data curation role, covering the management activities to annotate, scrub, and maintain data for use and reuse. Recording that role on the associated output ensures that the person who did the unglamorous work of making data reusable is credited for it, rather than that effort vanishing into a methods section.

Where shared vocabulary fits

“FAIR”, “findable”, “accessible”, “interoperable”, “reusable”, “metadata”, and “provenance” are used loosely — and “FAIR” is routinely conflated with “open” — which undermines the very interoperability the principles call for. A shared, federated vocabulary that defines these terms precisely is what lets a FAIR claim made in one community be understood in another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the data-infrastructure domain.

Related reading
June 15, 2026
Electronic lab notebooks and structured record-keeping across the research lifecycle

When we picture the scholarly record, we tend to think of its end products: the published paper, the deposited dataset, the citation. But each of those is the visible tip of a much larger body of work — the active, day-to-day conduct of research, where experiments are designed and run, samples processed, instruments operated and observations recorded. For generations this working phase was captured, if at all, in the paper laboratory notebook: a bound book on a bench, legible only to its author, locked in a drawer, and disconnected from everything else. An immense amount of crucial information about how research is actually done remained invisible to the wider record. The electronic lab notebook and the structured record-keeping practices around it are changing that. This article looks at how, drawing on the research-lifecycle domain of the CASRAI Dictionary.

What an electronic lab notebook is

An electronic lab notebook, or ELN, is software that replaces the paper notebook as the place where researchers record their day-to-day work: experiments, protocols, observations, results and the reasoning behind decisions. At its simplest, an ELN offers obvious practical advantages over paper — it is searchable, backed up, shareable, and resistant to the coffee stains and illegible handwriting that have plagued laboratory science forever. But its deeper significance is that it makes the working record digital and therefore connectable. A paper notebook is an island; an electronic one can be linked to the protocols it follows, the instruments and samples it references, the data files it produces and the people who did the work. The ELN is the point at which the active phase of research enters the connected world that the rest of the record already inhabits.

Capturing the active phase as connected metadata

This is the central idea: the ELN lets the active phase of research be captured as connected metadata rather than disappearing into a drawer. When work is recorded electronically and linked properly, a rich web of relationships can be built around it — this experiment used that protocol; it was performed by these people on that instrument; it consumed these samples and produced these data files; it belongs to this project and contributes to that publication. The working phase stops being a black box between the start of a project and its outputs, and becomes a documented, navigable part of the record. This matters for reproducibility, because others can see exactly how a result was produced; for collaboration, because the record is shared rather than siloed; and for integrity, because the chain from question to result is visible rather than reconstructed after the fact.

FAIR principles for the working record

The same FAIR principles — Findable, Accessible, Interoperable, Reusable — that govern published data apply, with equal force, to the records created during the active phase. An ELN that captures structured, well-described records makes the working record findable and reusable in a way a paper notebook never could be. The principle is that good data management should not begin at the moment of deposit, when a project ends, but should run through the entire lifecycle, starting at the bench. If records are created in a structured, connected form from the outset, preparing data for deposit becomes a matter of harvesting and tidying what already exists, rather than reconstructing it. Good record-keeping during the active phase is, in this sense, the foundation of good data management overall.

Provenance: the PROV standard

A particular strength of structured electronic record-keeping is its capacity to capture provenance — the record of how something came to be: what data was used, what processes acted on it, what agents (people, software, instruments) were involved, and in what order. Provenance is the basis of trust in a result, because it lets others trace exactly how that result was produced and verify each step. The PROV standard provides a formal, machine-readable model for expressing provenance — describing the entities, activities and agents in a process and the relationships between them — so that the chain of how a result was produced can be recorded consistently and understood across systems. An ELN that captures provenance in line with such a standard turns the working record into something far more powerful than a diary: a verifiable account of how knowledge was made.

Identifying the work itself: activity identifiers

If the active phase is to be connected to the rest of the research landscape, the work itself needs to be identifiable. Persistent identifiers have transformed how we refer to outputs and people; the same logic is now being applied to research activities. RAiD (the Research Activity Identifier) is a persistent identifier for research projects and activities, providing a stable handle for the work itself — not just its eventual outputs. With an activity identifier, the records captured in an ELN, the data produced, the people involved and the resulting publications can all be tied to a single, persistent identity for the project. The whole arc of a piece of research — from the work as it happens to the products it yields — can then be traced as a connected whole rather than a set of disconnected fragments.

A consistent vocabulary across the lifecycle

For records created at the bench to connect with everything downstream — data repositories, CRIS platforms, publications — the elements they contain must mean the same thing everywhere: what a protocol, a sample, an instrument or an activity denotes. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the record captured in an electronic lab notebook is understood identically wherever it flows. And because the work recorded there — investigation, data curation, methodology — is genuine contribution, it can be described in the same framework used for every output, the CRediT taxonomy and its full set of contribution roles. The electronic lab notebook brings the most hands-on phase of research into the connected record; structured record-keeping, provenance and activity identifiers let that phase take its rightful place in the story of how knowledge is made.

June 15, 2026

Tag: FAIR data

Why data citation matters

The eight principles

How to cite a dataset in practice

DataCite DOIs and the reference list

Data availability statements close the loop

Bringing it together

The defining Vs of big data

Distributed processing: how big data is handled

Big data in research, and its pitfalls

Big data and FAIR principles

Frequently asked questions

How big does data have to be to count as big data?

What are the original three Vs?

Why is veracity important?

How does big data relate to FAIR data?

The hidden cost of keeping everything

Appraisal and data minimisation

Green software and computation

Preservation that lasts: OAIS

Sustainability and FAIR, properly understood

A consistent vocabulary for digital sustainability

The Global Alliance for Genomics and Health

FAIR principles in a genomics context

Consent, controlled access and data archives

File and metadata formats

Frequently asked questions

What is GA4GH?

Does sharing genomic data mean making it openly available to everyone?

How do FAIR principles apply to genetics data?

Why does consent metadata matter for data sharing?

The 20 standard amino acids and their notation

From sequence to structure: how protein data is recorded

UniProt and the Protein Data Bank

Why standard notation supports reproducibility

Frequently asked questions

How many standard amino acids are there?

What is the difference between one-letter and three-letter amino-acid codes?

What do UniProt and the PDB store?

How do amino-acid standards support FAIR data?

What each principle means

The role of persistent identifiers and metadata

FAIR versus open

Frequently asked questions

What does FAIR stand for?

Does FAIR mean the same as open data?

Why are persistent identifiers important for FAIR data?

Can data be FAIR without being publicly downloadable?

What Te Mana Raraunga is

Grounded in Te Tiriti o Waitangi

How the regional model complements CARE

Distinct from other Indigenous data frameworks

CARE alongside FAIR

A consistent vocabulary for grounded governance

Why a licence is the reusability switch

The two main choices for data

CC0 — the public-domain dedication

CC BY — attribution required

Choosing between them

Data are not software: a critical caveat

A practical checklist

How this connects to contribution and credit

Where shared vocabulary fits

Related reading

Where FAIR came from, and what it is for

The four principles, in practice

FAIR is not the same as open

A practical path to FAIR

Crediting the work behind FAIR data

Where shared vocabulary fits

Related reading

What an electronic lab notebook is

Capturing the active phase as connected metadata

FAIR principles for the working record

Provenance: the PROV standard

Identifying the work itself: activity identifiers

A consistent vocabulary across the lifecycle