Tag: FAIR data

  • Citing Data Properly: The Joint Declaration of Data Citation Principles

    For decades, the data underpinning a study lived in a footnote, an appendix, or nowhere visible at all. A reader who wanted to inspect, reuse, or build on those data had little to go on. As research has become more data-intensive, that omission has grown harder to justify. The Joint Declaration of Data Citation Principles, published through FORCE11 in 2014, was a deliberate attempt to fix it by treating datasets as legitimate, citable research outputs in their own right.

    Why data citation matters

    Citing data is not merely good manners. It serves the same purposes as citing the literature: it credits the people who produced the work, it lets readers verify claims, and it builds a traceable record of how knowledge accumulates. When a dataset is cited formally, the citation can be counted, indexed, and linked, which means the often-considerable labour of collecting, cleaning, and documenting data becomes visible and rewardable. This connects directly to broader efforts in FAIR data, where the goal is for data to be findable, accessible, interoperable, and reusable.

    The eight principles

    The Declaration is built around eight principles that, taken together, describe what responsible data citation looks like:

    • Importance. Data should be considered legitimate, citable products of research, deserving the same status as publications.
    • Credit and attribution. Citations should give scholarly credit and normative, legal attribution to everyone who contributed to the data.
    • Evidence. Where a claim rests on data, the corresponding data should be cited.
    • Unique identification. Citations should include a persistent, machine-actionable, globally unique identifier.
    • Access. Citations should make it possible to reach the data themselves and their associated metadata and documentation.
    • Persistence. Identifiers and metadata should persist even beyond the lifespan of the data they describe.
    • Specificity and verifiability. Citations should allow a precise version and subset of the data to be identified.
    • Interoperability and flexibility. Citation methods should work across communities while accommodating disciplinary differences.

    These principles are intentionally technology-neutral. They do not mandate a single repository or identifier scheme; they describe outcomes that any sound practice should achieve.

    How to cite a dataset in practice

    A well-formed data citation looks much like a reference to an article, but with a few additions. At minimum it should carry the creator or creators, the year of publication, the title of the dataset, the publisher or repository, a version where one exists, and a persistent identifier. In most cases that identifier is a DataCite DOI, resolvable to a landing page that describes the dataset and points to the files. A typical reference takes the shape: Creator(s) (Year): Title. Version. Publisher. Dataset. DOI.

    Two details repay attention. First, versioning is not optional for datasets that change over time. Citing the specific version used means a future reader can reproduce exactly what was analysed, rather than a later, possibly different, release. Second, the identifier should appear in the reference list, not merely in the running text. Burying a dataset DOI in a sentence keeps it out of the indexing and counting systems that make citation meaningful in the first place.

    DataCite DOIs and the reference list

    DataCite was established precisely to assign DOIs to research data and to maintain the metadata that makes those DOIs useful. When a repository mints a DataCite DOI for a dataset, it registers structured metadata describing the creators, title, publication year, resource type, and related identifiers. That metadata is what allows discovery services and reference managers to handle data citations the way they handle article citations. Placing the DOI in the reference list, formatted to the relevant style, lets indexing infrastructure pick it up and attribute it correctly.

    Data availability statements close the loop

    Many publishers now require a data availability statement, a short passage telling readers where the underlying data can be found and under what conditions. Done well, the statement names the repository and gives the persistent identifier, linking the prose of the article to the formal citation in the reference list. Done poorly, it says only that data are available on request, which research has repeatedly shown to be an unreliable route to access. A good availability statement and a properly formatted data citation are two halves of the same commitment: that the evidence behind a study can actually be found and reused.

    Bringing it together

    The Joint Declaration did not invent the idea that data deserve credit, but it gave the community a shared, citable reference point. The practical implications are modest and achievable: assign a persistent identifier, capture the version, put the citation in the reference list, and write a data availability statement that points to it. Standards bodies and metadata schemas, including the work catalogued in the CASRAI data dictionary and contributor frameworks such as CRediT, give the surrounding vocabulary to describe who did what. The principles themselves are a reminder that data are not a by-product of research but, increasingly, one of its most valuable outputs.

  • Big Data and the Vs of Data Explained for Research

    Big data refers to datasets so large, fast-moving or varied that traditional database tools cannot capture, store or analyse them within a reasonable time. It is defined less by an exact size threshold than by a set of characteristics, usually summarised as the “Vs”, and by the distributed computing methods needed to process it. In research, big data spans genomics, sensor networks, clinical records, social media and large-scale simulations.

    The defining Vs of big data

    The concept began with three Vs and has since expanded. The table below sets out the five most widely cited.

    Characteristic Meaning Research example
    Volume The sheer quantity of data, from terabytes to petabytes and beyond Whole-genome sequencing across cohorts
    Velocity The speed at which data is generated and must be processed Real-time readings from environmental sensors
    Variety The mix of formats: structured, semi-structured and unstructured Combining tables, images, text and audio
    Veracity The trustworthiness, accuracy and completeness of the data Cleaning noisy or missing clinical records
    Value The usefulness of insights that can be extracted Identifying disease risk factors at scale

    Volume, velocity and variety were the original three, capturing the scale, speed and heterogeneity that overwhelm conventional tools. Veracity was added to stress that more data is not automatically better data; noise, bias and gaps must be managed. Value reminds us that the point of all this effort is actionable insight, not collection for its own sake.

    Distributed processing: how big data is handled

    No single machine can hold or analyse a petabyte efficiently, so big data relies on distributed processing: spreading storage and computation across clusters of many machines that work in parallel. The foundational pattern was MapReduce, which splits a task into pieces, processes them across nodes, then combines the results. Frameworks such as Apache Hadoop and, later, Apache Spark made this approach mainstream, with Spark adding in-memory processing for far greater speed. Cloud platforms now offer this elasticity on demand, letting researchers scale resources to the dataset rather than the other way round.

    Big data in research, and its pitfalls

    Used well, big data lets researchers detect patterns invisible at small scale, model complex systems and test hypotheses across enormous samples. But scale brings risks. Large datasets can be biased or unrepresentative despite their size, and the volume can lull analysts into ignoring how the data was collected. Crucially, big data does not suspend statistical thinking: with millions of observations, almost any difference becomes statistically significant, which is exactly why effect size matters more than ever, and why a small p-value on its own means little. Big data also fuels machine learning, where larger samples help guard against the overfitting that plagues models trained on too little.

    Big data and FAIR principles

    The promise of big data depends on the data being usable, and that is where the FAIR principles, that data should be Findable, Accessible, Interoperable and Reusable, become essential. Findability requires rich metadata and persistent identifiers. Interoperability requires shared vocabularies, the kind standardised in the CASRAI dictionary, so that varied sources can be combined meaningfully. Reusability requires clear provenance and licensing. Without these foundations, a large dataset is merely a large liability. Our broader work on standards and metadata, including our guidance for authors and our reproducibility coverage, sets out how to make big research data dependable rather than just big.

    Frequently asked questions

    How big does data have to be to count as big data?

    There is no fixed size. Big data is defined by characteristics, the Vs, rather than a threshold. The practical test is whether traditional tools struggle to store or process it within a useful timeframe.

    What are the original three Vs?

    Volume, velocity and variety: the scale of the data, the speed at which it arrives, and the diversity of its formats. Veracity and value were added later to address quality and usefulness.

    Why is veracity important?

    Because size does not guarantee quality. Large datasets can contain errors, bias, duplicates and missing values. Veracity emphasises assessing and improving trustworthiness before drawing conclusions.

    How does big data relate to FAIR data?

    FAIR principles make big data usable by ensuring it is Findable, Accessible, Interoperable and Reusable. Shared vocabularies and persistent identifiers, such as those in the CASRAI dictionary, let varied large datasets be combined and reused reliably.

  • Digital sustainability: the environmental cost of data storage and preservation

    The instinct in modern research is to keep everything. Storage is cheap, deletion feels risky, and the principles of openness and reproducibility seem to counsel retaining as much as possible for as long as possible. But this instinct conceals a real and growing cost. Storing data, running computations and preserving digital material for the long term all consume energy, and energy carries a carbon footprint. The cloud is not a weightless abstraction; it is data centres drawing power and demanding cooling, somewhere, continuously. As research becomes ever more data-intensive, the environmental cost of its digital life — storage, computation, preservation — can no longer be treated as invisible. Digital sustainability is the discipline of taking that cost seriously, and it is the subject of this article, which draws on the sustainable-research domain of the CASRAI Dictionary.

    The hidden cost of keeping everything

    The first thing digital sustainability asks us to see is that “keep it just in case” is not a cost-free default. Every dataset retained indefinitely occupies storage that must be powered, cooled, maintained, migrated to new media over time, and backed up — and the aggregate of countless such decisions across the research system is substantial. There is a real tension here with the open-data ideal. The drive to make data findable and reusable is valuable, but it can shade into digital hoarding: keeping vast quantities of low-value data on the vague principle that more is always better, without asking whether a dataset is worth its ongoing cost. The FAIR principles call for data to be findable and reusable — not for everything to be kept forever regardless of value. Distinguishing data worth preserving from data that need not be is itself an act of stewardship, not a betrayal of openness.

    Appraisal and data minimisation

    The practices that respond to this are appraisal and data minimisation. Appraisal — long established in the archival and records-management traditions — is the disciplined process of deciding what to keep, for how long, and what may responsibly be discarded, based on enduring value rather than reflex. Data minimisation, familiar also from data protection, is the principle of collecting and retaining only what is genuinely needed. Applied to research, these practices mean making conscious decisions: which raw data must be preserved to support published results and which intermediate files can be regenerated if ever needed; which datasets have lasting reuse value and which were transient. This is not an argument for carelessly deleting valuable data — the cost of losing irreplaceable data far exceeds the cost of storing it. It is an argument for deciding, deliberately and well, rather than defaulting to indiscriminate retention. Good appraisal keeps what matters and lets go of what does not, serving both sustainability and the long-term usability of the record.

    Green software and computation

    Storage is only part of the picture; computation has its own footprint. The green software movement — advanced by organisations such as the Green Software Foundation — aims to reduce the environmental impact of software itself. A central concept is Software Carbon Intensity (SCI), a specification for measuring the carbon emissions associated with running software, so that the impact can be quantified, compared and reduced rather than guessed at. For research, the principles translate into practical questions: is a computation more efficient than it needs to be; is it run repeatedly when results could be cached; is the workload run where and when the energy is cleaner? Efficient, well-considered computation is not only cheaper and faster but less carbon-intensive, and measuring impact, as SCI encourages, is the precondition for managing it.

    Preservation that lasts: OAIS

    Sustainability is not only about using less; it is also about preserving well, so that what is kept genuinely endures and the energy spent keeping it is not wasted. The reference model for long-term digital preservation is OAIS — the Open Archival Information System reference model — which provides a framework for what a trustworthy digital archive must do to preserve information over the long term and keep it accessible and understandable to future users. OAIS matters to digital sustainability in two ways. First, preservation is itself an ongoing activity with an environmental cost, and doing it according to a sound model means that cost buys real durability rather than slow decay. Second, preserving fewer things well — properly described, in sustainable formats, in a trustworthy archive — is far better, environmentally and intellectually, than preserving many things badly, where data accumulates and yet quietly becomes unusable through neglect. Good preservation and disciplined appraisal are two sides of the same sustainable practice.

    Sustainability and FAIR, properly understood

    None of this is in conflict with FAIR or with open research, properly understood. FAIR is about good stewardship — making the data that is worth keeping findable, accessible, interoperable and reusable — not about hoarding. A sustainable approach is, in fact, a more honest expression of FAIR: it concentrates effort on the data that genuinely merits it, rather than spreading thin attention and real resources across everything indiscriminately. Sustainability and good data stewardship point in the same direction: keep what matters, describe it well, preserve it properly, and let go of what does not earn its keep.

    A consistent vocabulary for digital sustainability

    For sustainable practice to be applied consistently — across repositories, institutions and funders — the concepts involved, such as retention periods, appraisal decisions, preservation levels and format requirements, must be described in ways that mean the same thing everywhere. That consistency is what the CASRAI Dictionary works towards: a shared vocabulary so that decisions about what to keep, how to preserve it and for how long are understood the same way wherever they are recorded. And because appraising, curating and preserving data well is genuine, skilled work, it can be described in the same shared framework as any other contribution — the CRediT taxonomy and the wider apparatus of research administration. The most sustainable digital research is not the research that stores the least, but the research that decides most carefully what is worth keeping — and then keeps it well.

  • Genomic Data-Sharing Standards: GA4GH and Responsible Access Explained

    Genomic data sharing is the responsible exchange of genetic data between researchers and repositories using common standards for file formats, metadata, consent and access control. Because genetic data is sensitive and richly structured, sharing it usefully depends on agreed technical standards and clear governance rather than ad-hoc file transfers.

    This article describes how genetic and genomic data is shared from a data-standards and governance perspective. It is not clinical genetics advice; the focus throughout is notation, metadata, interoperability and access frameworks.

    The Global Alliance for Genomics and Health

    The Global Alliance for Genomics and Health (GA4GH) is an international standards organisation that develops frameworks and technical specifications to enable responsible genomic data sharing. Its work spans both governance — such as consent and data-access policy frameworks — and technical interoperability standards that allow systems to exchange genomic data and query it consistently.

    The value of a shared standards body is that institutions in different countries can align on common interfaces and metadata conventions, so a dataset described and stored according to GA4GH-aligned conventions can be discovered and accessed by authorised researchers elsewhere. Controlled vocabularies underpinning these descriptions are the kind of structured terms recorded in the CASRAI dictionary.

    FAIR principles in a genomics context

    Genomic data sharing is closely aligned with the FAIR principles: data should be findable, accessible, interoperable and reusable. In genomics, “accessible” does not mean open to everyone; it means accessible under clearly defined and machine-readable conditions, which often include authorisation and consent checks.

    FAIR principle Genomics interpretation
    Findable Datasets carry persistent identifiers and rich, searchable metadata
    Accessible Access is defined by clear, often controlled, machine-readable conditions
    Interoperable Standard formats and shared vocabularies allow systems to exchange data
    Reusable Consent terms, provenance and licensing are documented for re-analysis

    Consent, controlled access and data archives

    Much genetic data is held in controlled-access archives rather than fully open repositories. Under this model, descriptive metadata may be openly browsable while the underlying genetic data is released only to researchers whose project and credentials have been reviewed and approved by a data-access committee.

    Consent is the cornerstone of this governance. The terms under which data was originally collected determine how it may later be shared and reused, so consent metadata must travel with the data. This makes documented provenance — who collected the data, under what consent, and with what permitted uses — an essential part of responsible sharing.

    File and metadata formats

    Interoperability in genomics rests on standardised file formats for sequence reads and variants, paired with structured metadata describing the sample, the experiment and the access conditions. Consistent formats let independent groups validate, re-align and re-analyse data, supporting the goals discussed across our reproducibility coverage. Persistent identifiers tie datasets to their originating studies and contributors, as explained in our note on persistent identifiers in 2026.

    The same emphasis on stable identifiers and structured notation appears when recording protein information; see our companion guide on amino acids and protein data notation. For broader context, browse our data-infrastructure news and the guidance for authors on describing datasets.

    Frequently asked questions

    What is GA4GH?

    The Global Alliance for Genomics and Health is an international standards organisation that develops governance frameworks and technical specifications to enable responsible genomic data sharing across institutions and borders.

    Does sharing genomic data mean making it openly available to everyone?

    No. Responsible sharing usually means controlled access: descriptive metadata may be browsable, but the underlying genetic data is released only to authorised researchers whose projects and credentials have been reviewed and approved.

    How do FAIR principles apply to genetics data?

    FAIR principles require genetic data to be findable through persistent identifiers and metadata, accessible under clearly defined conditions, interoperable through standard formats, and reusable with documented consent, provenance and licensing.

    Why does consent metadata matter for data sharing?

    Consent determines the permitted uses of data. Because those terms govern future reuse, consent and provenance information must accompany the data so that downstream researchers only use it within the agreed conditions.

  • Amino Acids: Notation, Protein Data and How Sequences Are Recorded

    Amino acids are small organic molecules that join together in chains to build proteins, and the 20 standard amino acids form the common alphabet used to write and share protein sequence data. Each amino acid carries a standard one-letter and three-letter abbreviation, giving researchers an unambiguous notation for recording sequences in databases, publications and data-exchange formats.

    From a data-infrastructure perspective, amino acids matter less as chemistry and more as a controlled vocabulary: a fixed set of symbols that lets sequence information move reliably between laboratories, repositories and software tools without loss of meaning.

    The 20 standard amino acids and their notation

    Proteins are built from 20 standard amino acids, each of which has a residue name, a three-letter code and a single-letter code. The single-letter codes are the backbone of compact sequence notation, allowing a protein of several hundred residues to be written as one continuous string of letters.

    Amino acid Three-letter One-letter
    Alanine Ala A
    Glycine Gly G
    Leucine Leu L
    Serine Ser S
    Tryptophan Trp W
    Tyrosine Tyr Y

    The full set of 20 covers residues such as alanine, arginine, asparagine, aspartate, cysteine, glutamate, glutamine, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine and valine. Standardised codes mean that a sequence recorded in one system is read identically in another, which is the foundation of interoperable protein data. Consistent notation of this kind is exactly the type of controlled term documented in the CASRAI dictionary.

    From sequence to structure: how protein data is recorded

    Protein data exists at two complementary levels. Sequence data describes the linear order of amino-acid residues, while structure data describes the three-dimensional arrangement of atoms once the chain has folded. Both layers need stable identifiers and agreed formats so that records remain findable and reusable over time.

    Sequence records are commonly written in FASTA format, a plain-text convention in which a header line carries an identifier and the following lines hold the one-letter residue string. Structure records use formats associated with atomic coordinates, capturing the position of each atom rather than only the residue order.

    UniProt and the Protein Data Bank

    Two long-standing resources anchor protein data sharing. UniProt is a comprehensive, curated repository of protein sequence and functional information, assigning persistent accession identifiers to protein entries. The Protein Data Bank (PDB) is the established archive for experimentally determined three-dimensional structures of proteins and other biological macromolecules.

    Resource Primary content Identifier role
    UniProt Protein sequences and functional annotation Stable accession per protein entry
    Protein Data Bank (PDB) 3D structural coordinates Stable entry identifier per structure

    Both resources illustrate good practice for research data infrastructure: persistent identifiers, structured metadata and open access to underlying records. Linking a sequence accession to a structure entry creates a navigable web of evidence, much as persistent identifiers connect outputs across the wider scholarly record described in our overview of persistent identifiers in 2026.

    Why standard notation supports reproducibility

    Because the amino-acid alphabet is fixed and the abbreviations are standardised, protein data aligns naturally with the FAIR principles — findable, accessible, interoperable and reusable. A sequence written in standard one-letter notation can be searched, aligned and compared across repositories without manual reconciliation, and a structure deposited with rich metadata can be revisited by independent researchers. This connects protein data to the broader agenda covered in our reproducibility news, and to the related question of how genomic data is shared responsibly in genomic data-sharing standards explained. For practical guidance on citing and describing such resources, see our guidance for authors.

    Frequently asked questions

    How many standard amino acids are there?

    There are 20 standard amino acids that serve as the common building blocks of proteins. Each has an agreed three-letter and one-letter abbreviation, forming a fixed alphabet for recording and sharing sequence data.

    What is the difference between one-letter and three-letter amino-acid codes?

    Three-letter codes such as Ala or Gly are readable abbreviations often used in text and structural records, while one-letter codes such as A or G create compact sequence strings ideal for databases and alignment software. Both refer to the same residues.

    What do UniProt and the PDB store?

    UniProt stores curated protein sequence and functional information with stable accession identifiers, while the Protein Data Bank stores experimentally determined three-dimensional structures with their own persistent entry identifiers. Together they cover the sequence and structure layers of protein data.

    How do amino-acid standards support FAIR data?

    A fixed notation and well-described repository records make protein data findable, accessible, interoperable and reusable. Standard codes remove ambiguity, so sequences and structures can be exchanged and compared across systems without loss of meaning.

  • Te Mana Raraunga: Māori data sovereignty as a regional model for Indigenous data governance

    The global conversation about Indigenous data governance has, in recent years, found a powerful shared language in the CARE Principles for Indigenous Data Governance — Collective benefit, Authority to control, Responsibility and Ethics. CARE provides an internationally recognised frame, articulated by the Global Indigenous Data Alliance (GIDA), that positions Indigenous peoples’ rights and interests at the centre of how data about them is governed. But principles at the global level are realised in particular places, by particular peoples, grounded in particular relationships and legal traditions. One of the most developed expressions of Indigenous data sovereignty anywhere is Te Mana Raraunga, the Māori Data Sovereignty Network in Aotearoa New Zealand, whose work shows how a regional model can give CARE concrete meaning while standing on foundations all its own. This article examines that model, drawing on the Indigenous data and CARE domain of the CASRAI Dictionary.

    What Te Mana Raraunga is

    Te Mana Raraunga is a network advocating for Māori rights and interests in data — for Māori data sovereignty. The phrase te mana raraunga itself speaks to the authority and integrity that attach to data, and the network exists to assert that Māori, as the people to whom much of that data relates, have legitimate rights of governance over it. Its concerns span how Māori data is collected, who controls it, how it is used, and whether its use serves Māori aspirations or merely extracts from Māori communities. The network has been instrumental in defining what Māori data sovereignty means in practice and in pressing institutions to recognise and respect it. It represents not an abstract ideal but an organised, articulate movement with a developed body of principles.

    Grounded in Te Tiriti o Waitangi

    What distinguishes the Māori model, and makes it more than a local application of a global frame, is its foundation in Te Tiriti o Waitangi — the Treaty of Waitangi, the founding constitutional document of Aotearoa New Zealand. Te Tiriti establishes a relationship between Māori and the Crown and affirms Māori authority over their own affairs and treasured things. Māori data sovereignty draws directly on this: if Māori hold authority over their taonga (treasures) and their own domains, then data about Māori — their people, lands, language and knowledge — falls within the scope of that authority. This gives Māori data sovereignty a distinctive constitutional grounding that an appeal to general principle does not: the argument is not only ethical but rests on a recognised relationship and a foundational agreement, which lends the model particular force.

    How the regional model complements CARE

    The relationship between Te Mana Raraunga and the global CARE frame is complementary, not competitive, and understanding how illuminates Indigenous data governance more broadly.

    • CARE provides the shared language. Collective benefit, authority to control, responsibility and ethics give a vocabulary recognisable across borders and useful for engaging international infrastructures and institutions.
    • The regional model provides the grounding. Te Tiriti gives Māori data sovereignty a specific constitutional foundation and a specific people whose authority is being asserted, turning general principle into concrete, situated claim.
    • Each strengthens the other. The global frame lends regional movements international recognition and solidarity; regional models like the Māori one give the global principles tested, real-world expression and demonstrate that they can be operationalised.

    This is why Indigenous data sovereignty is best understood as a family of grounded movements connected by shared principles, rather than a single uniform doctrine.

    Distinct from other Indigenous data frameworks

    It is important not to flatten the diversity of Indigenous data governance into one model. The Māori approach is distinct, for example, from Canada’s widely cited OCAP principles — Ownership, Control, Access and Possession — developed by and for First Nations in a different constitutional and historical context. OCAP and Māori data sovereignty share a commitment to Indigenous authority over Indigenous data, but they arise from different peoples, different legal foundations and different histories, and they are expressed differently. Recognising this matters: Indigenous peoples are not interchangeable, and good practice does not lift a framework from one context and impose it on another. The right model is the one grounded in the rights, relationships and aspirations of the specific people concerned. The global CARE frame accommodates this diversity precisely because it sets principles rather than prescribing a single mechanism.

    CARE alongside FAIR

    Indigenous data sovereignty also reshapes how the familiar FAIR principles — Findable, Accessible, Interoperable, Reusable — are understood. FAIR is concerned with the technical qualities that make data useful and reusable, but it is largely silent on questions of power, consent and benefit. CARE, and grounded models like Te Mana Raraunga, supply exactly what FAIR leaves out: who decides, who benefits, and on whose authority data is used. The two are meant to operate together — data can be FAIR and CARE at once — but where they pull in different directions, the Māori model is clear that authority and collective benefit are not negotiable conveniences. Making data maximally open is not a virtue if it overrides the rights of the people the data concerns.

    A consistent vocabulary for grounded governance

    For Indigenous data governance to be respected across the systems that hold and share research data, the terms involved — consent conditions, governance authority, access and benefit arrangements, provenance — must be described in ways that carry their meaning faithfully wherever the data travels. That consistency is part of what the CASRAI Dictionary works towards: a shared vocabulary so that a governance condition asserted by a community is not quietly lost when its data moves between systems. And because stewarding Indigenous data and partnering with the communities it concerns is genuine, recognisable work, it can be described in the same shared framework as any other contribution — the CRediT taxonomy and the wider apparatus of research administration. Te Mana Raraunga shows what Indigenous data sovereignty looks like when it is grounded in a people’s own authority; the global principles show how such grounded models can speak to one another and to the world.

  • Licensing research data: CC-BY, CC0 and when to use each

    You can deposit a dataset in a trusted repository, describe it with rich metadata, and give it a DOI — and still leave it effectively unusable, because you forgot the one line that tells a reuser what they are allowed to do with it. A dataset without a clear licence is data nobody can confidently build on: a careful researcher, unsure of the terms, will simply not reuse it. Licensing is therefore not a legal afterthought but the part of the data-infrastructure domain that determines whether a deposit delivers the “R” in FAIR at all. This guide explains the main choices — principally CC0 and CC BY — and when each fits.

    Why a licence is the reusability switch

    The FAIR principles ask that data be Findable, Accessible, Interoperable, and Reusable — and reusability rests explicitly on data being “released with a clear and accessible data usage licence”. Without a licence, default copyright and database rights leave the legal status ambiguous, and ambiguity is fatal to reuse: a would-be user cannot tell whether combining your data with theirs, redistributing it, or building a tool on it is permitted. An explicit, standard, machine-readable licence resolves that uncertainty in advance, for everyone, without anyone having to ask. That is why “attach an explicit licence” is the step that turns a findable dataset into a reusable one.

    The two main choices for data

    CC0 — the public-domain dedication

    CC0 is a Creative Commons tool by which the rights-holder waives, to the fullest extent the law allows, all copyright and related rights in the work — placing it as close to the public domain as possible. For data, CC0 means a reuser can use, combine, modify, and redistribute the data with no conditions at all, including no obligation to attribute. This is widely recommended as the default for research data, and for a specific reason: data are routinely aggregated from many sources, and attribution requirements that stack up across hundreds of datasets (“attribution stacking”) can become legally and practically unworkable. CC0 removes that friction entirely and maximises interoperability. Several major data repositories and infrastructures apply CC0 by default for exactly this reason.

    Importantly, CC0 waives legal requirements, not scholarly norms. Citing the data you use remains an academic and ethical expectation regardless of the licence — CC0 simply means that expectation is enforced by the norms of good scholarship rather than by copyright law.

    CC BY — attribution required

    CC BY permits the same broad reuse — use, adaptation, redistribution, including commercially — but on the single condition that the original creator is credited. For data, CC BY is appropriate where attribution matters enough to be a legal condition, or where a funder or institution requires it. It is the most permissive of the conditional Creative Commons licences and is the default for many open-access publications. The trade-off relative to CC0 is precisely the attribution clause: it guarantees credit, but it reintroduces the attribution-stacking problem when many datasets are combined.

    Choosing between them

    • Prefer CC0 for data intended for the widest possible aggregation and reuse, especially where the data will be merged with many other sources. It maximises interoperability and removes legal friction; rely on citation norms for credit.
    • Choose CC BY where attribution must be a legal condition, where a funder or repository mandates it, or where the dataset is a discrete, citable product whose creators need enforceable credit.
    • Be cautious with more restrictive clauses. Non-commercial (NC) and No-Derivatives (ND) terms substantially limit reuse and can render data incompatible with other open data; they are generally discouraged for research data unless a specific ethical or legal constraint demands them.

    Data are not software: a critical caveat

    Creative Commons licences are designed for content — text, images, and data — and Creative Commons itself advises against using them for software. Software has needs that CC licences do not address: patent grants, the distinction between source and compiled code, and copyleft mechanics. For code, use a recognised software licence instead — a permissive one such as MIT, BSD, or Apache 2.0, or a copyleft one such as the GPL. If your deposit bundles a dataset and the code that processes it, licence each part appropriately: a CC licence (or CC0) for the data, an OSI-approved software licence for the code. Conflating the two is one of the most common licensing mistakes in research deposits.

    A practical checklist

    1. Confirm you have the right to licence the data. Check funder terms, any data-sharing agreements, third-party data within your dataset, and — for personal or sensitive data — consent and governance constraints. A licence cannot grant rights you do not hold.
    2. Default to CC0 for data unless there is a positive reason to require attribution; choose CC BY where there is.
    3. Licence software separately with an OSI-approved licence; never put code under a Creative Commons licence.
    4. State the licence explicitly in the deposit metadata and in any data availability statement, using the standard licence identifier so it is machine-readable.
    5. Cite the data you reuse regardless of its licence — the scholarly norm holds even when the law does not require it.

    How this connects to contribution and credit

    Licensing answers “what may be done with this output?”; it is a sibling of the question “who made it?”, which the CRediT taxonomy answers. A dataset’s intellectual work is recorded on the associated paper through roles such as Data curation and Investigation, while the licence governs downstream reuse of the artefact itself. Used together — a clear licence on the data and clear contribution roles on the people — they ensure both the dataset and its creators are properly accounted for.

    Where shared vocabulary fits

    “CC0”, “CC BY”, “public domain”, “attribution”, and “reuse” are interpreted differently across repositories and funders, which undermines the very interoperability that licensing is meant to enable. A shared, federated vocabulary that defines these terms precisely — pointing back to Creative Commons for the licences and to the FAIR principles for the reusability requirement — is what lets a licence chosen for one repository be understood correctly in another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the data-infrastructure domain.

    Related reading

  • Electronic lab notebooks and structured record-keeping across the research lifecycle

    When we picture the scholarly record, we tend to think of its end products: the published paper, the deposited dataset, the citation. But each of those is the visible tip of a much larger body of work — the active, day-to-day conduct of research, where experiments are designed and run, samples processed, instruments operated and observations recorded. For generations this working phase was captured, if at all, in the paper laboratory notebook: a bound book on a bench, legible only to its author, locked in a drawer, and disconnected from everything else. An immense amount of crucial information about how research is actually done remained invisible to the wider record. The electronic lab notebook and the structured record-keeping practices around it are changing that. This article looks at how, drawing on the research-lifecycle domain of the CASRAI Dictionary.

    What an electronic lab notebook is

    An electronic lab notebook, or ELN, is software that replaces the paper notebook as the place where researchers record their day-to-day work: experiments, protocols, observations, results and the reasoning behind decisions. At its simplest, an ELN offers obvious practical advantages over paper — it is searchable, backed up, shareable, and resistant to the coffee stains and illegible handwriting that have plagued laboratory science forever. But its deeper significance is that it makes the working record digital and therefore connectable. A paper notebook is an island; an electronic one can be linked to the protocols it follows, the instruments and samples it references, the data files it produces and the people who did the work. The ELN is the point at which the active phase of research enters the connected world that the rest of the record already inhabits.

    Capturing the active phase as connected metadata

    This is the central idea: the ELN lets the active phase of research be captured as connected metadata rather than disappearing into a drawer. When work is recorded electronically and linked properly, a rich web of relationships can be built around it — this experiment used that protocol; it was performed by these people on that instrument; it consumed these samples and produced these data files; it belongs to this project and contributes to that publication. The working phase stops being a black box between the start of a project and its outputs, and becomes a documented, navigable part of the record. This matters for reproducibility, because others can see exactly how a result was produced; for collaboration, because the record is shared rather than siloed; and for integrity, because the chain from question to result is visible rather than reconstructed after the fact.

    FAIR principles for the working record

    The same FAIR principles — Findable, Accessible, Interoperable, Reusable — that govern published data apply, with equal force, to the records created during the active phase. An ELN that captures structured, well-described records makes the working record findable and reusable in a way a paper notebook never could be. The principle is that good data management should not begin at the moment of deposit, when a project ends, but should run through the entire lifecycle, starting at the bench. If records are created in a structured, connected form from the outset, preparing data for deposit becomes a matter of harvesting and tidying what already exists, rather than reconstructing it. Good record-keeping during the active phase is, in this sense, the foundation of good data management overall.

    Provenance: the PROV standard

    A particular strength of structured electronic record-keeping is its capacity to capture provenance — the record of how something came to be: what data was used, what processes acted on it, what agents (people, software, instruments) were involved, and in what order. Provenance is the basis of trust in a result, because it lets others trace exactly how that result was produced and verify each step. The PROV standard provides a formal, machine-readable model for expressing provenance — describing the entities, activities and agents in a process and the relationships between them — so that the chain of how a result was produced can be recorded consistently and understood across systems. An ELN that captures provenance in line with such a standard turns the working record into something far more powerful than a diary: a verifiable account of how knowledge was made.

    Identifying the work itself: activity identifiers

    If the active phase is to be connected to the rest of the research landscape, the work itself needs to be identifiable. Persistent identifiers have transformed how we refer to outputs and people; the same logic is now being applied to research activities. RAiD (the Research Activity Identifier) is a persistent identifier for research projects and activities, providing a stable handle for the work itself — not just its eventual outputs. With an activity identifier, the records captured in an ELN, the data produced, the people involved and the resulting publications can all be tied to a single, persistent identity for the project. The whole arc of a piece of research — from the work as it happens to the products it yields — can then be traced as a connected whole rather than a set of disconnected fragments.

    A consistent vocabulary across the lifecycle

    For records created at the bench to connect with everything downstream — data repositories, CRIS platforms, publications — the elements they contain must mean the same thing everywhere: what a protocol, a sample, an instrument or an activity denotes. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the record captured in an electronic lab notebook is understood identically wherever it flows. And because the work recorded there — investigation, data curation, methodology — is genuine contribution, it can be described in the same framework used for every output, the CRediT taxonomy and its full set of contribution roles. The electronic lab notebook brings the most hands-on phase of research into the connected record; structured record-keeping, provenance and activity identifiers let that phase take its rightful place in the story of how knowledge is made.