Tag: digital preservation

  • Digital sustainability: the environmental cost of data storage and preservation

    The instinct in modern research is to keep everything. Storage is cheap, deletion feels risky, and the principles of openness and reproducibility seem to counsel retaining as much as possible for as long as possible. But this instinct conceals a real and growing cost. Storing data, running computations and preserving digital material for the long term all consume energy, and energy carries a carbon footprint. The cloud is not a weightless abstraction; it is data centres drawing power and demanding cooling, somewhere, continuously. As research becomes ever more data-intensive, the environmental cost of its digital life — storage, computation, preservation — can no longer be treated as invisible. Digital sustainability is the discipline of taking that cost seriously, and it is the subject of this article, which draws on the sustainable-research domain of the CASRAI Dictionary.

    The hidden cost of keeping everything

    The first thing digital sustainability asks us to see is that “keep it just in case” is not a cost-free default. Every dataset retained indefinitely occupies storage that must be powered, cooled, maintained, migrated to new media over time, and backed up — and the aggregate of countless such decisions across the research system is substantial. There is a real tension here with the open-data ideal. The drive to make data findable and reusable is valuable, but it can shade into digital hoarding: keeping vast quantities of low-value data on the vague principle that more is always better, without asking whether a dataset is worth its ongoing cost. The FAIR principles call for data to be findable and reusable — not for everything to be kept forever regardless of value. Distinguishing data worth preserving from data that need not be is itself an act of stewardship, not a betrayal of openness.

    Appraisal and data minimisation

    The practices that respond to this are appraisal and data minimisation. Appraisal — long established in the archival and records-management traditions — is the disciplined process of deciding what to keep, for how long, and what may responsibly be discarded, based on enduring value rather than reflex. Data minimisation, familiar also from data protection, is the principle of collecting and retaining only what is genuinely needed. Applied to research, these practices mean making conscious decisions: which raw data must be preserved to support published results and which intermediate files can be regenerated if ever needed; which datasets have lasting reuse value and which were transient. This is not an argument for carelessly deleting valuable data — the cost of losing irreplaceable data far exceeds the cost of storing it. It is an argument for deciding, deliberately and well, rather than defaulting to indiscriminate retention. Good appraisal keeps what matters and lets go of what does not, serving both sustainability and the long-term usability of the record.

    Green software and computation

    Storage is only part of the picture; computation has its own footprint. The green software movement — advanced by organisations such as the Green Software Foundation — aims to reduce the environmental impact of software itself. A central concept is Software Carbon Intensity (SCI), a specification for measuring the carbon emissions associated with running software, so that the impact can be quantified, compared and reduced rather than guessed at. For research, the principles translate into practical questions: is a computation more efficient than it needs to be; is it run repeatedly when results could be cached; is the workload run where and when the energy is cleaner? Efficient, well-considered computation is not only cheaper and faster but less carbon-intensive, and measuring impact, as SCI encourages, is the precondition for managing it.

    Preservation that lasts: OAIS

    Sustainability is not only about using less; it is also about preserving well, so that what is kept genuinely endures and the energy spent keeping it is not wasted. The reference model for long-term digital preservation is OAIS — the Open Archival Information System reference model — which provides a framework for what a trustworthy digital archive must do to preserve information over the long term and keep it accessible and understandable to future users. OAIS matters to digital sustainability in two ways. First, preservation is itself an ongoing activity with an environmental cost, and doing it according to a sound model means that cost buys real durability rather than slow decay. Second, preserving fewer things well — properly described, in sustainable formats, in a trustworthy archive — is far better, environmentally and intellectually, than preserving many things badly, where data accumulates and yet quietly becomes unusable through neglect. Good preservation and disciplined appraisal are two sides of the same sustainable practice.

    Sustainability and FAIR, properly understood

    None of this is in conflict with FAIR or with open research, properly understood. FAIR is about good stewardship — making the data that is worth keeping findable, accessible, interoperable and reusable — not about hoarding. A sustainable approach is, in fact, a more honest expression of FAIR: it concentrates effort on the data that genuinely merits it, rather than spreading thin attention and real resources across everything indiscriminately. Sustainability and good data stewardship point in the same direction: keep what matters, describe it well, preserve it properly, and let go of what does not earn its keep.

    A consistent vocabulary for digital sustainability

    For sustainable practice to be applied consistently — across repositories, institutions and funders — the concepts involved, such as retention periods, appraisal decisions, preservation levels and format requirements, must be described in ways that mean the same thing everywhere. That consistency is what the CASRAI Dictionary works towards: a shared vocabulary so that decisions about what to keep, how to preserve it and for how long are understood the same way wherever they are recorded. And because appraising, curating and preserving data well is genuine, skilled work, it can be described in the same shared framework as any other contribution — the CRediT taxonomy and the wider apparatus of research administration. The most sustainable digital research is not the research that stores the least, but the research that decides most carefully what is worth keeping — and then keeps it well.

  • Open science across the research lifecycle: from preregistration to preservation

    Open science is often encountered as a set of separate practices: a journal’s open-access policy, a funder’s data-sharing requirement, a colleague’s preregistered study. Treated piecemeal, each can feel like an isolated obligation. But open science is most powerful, and most coherent, when its practices are understood as connected stages in the arc of a single project — when openness runs through the whole research lifecycle rather than appearing only at the end. Seen this way, preregistration, open data, open access and preservation are not unrelated requirements but successive expressions of one principle: that research is more trustworthy, more useful and more cumulative when it is conducted in the open. This article traces openness across the lifecycle through the research lifecycle domain of the CASRAI Dictionary.

    A global framework: the UNESCO Recommendation

    That open science is a connected whole rather than a collection of separate practices is reflected in the most significant international statement on the subject: the UNESCO Recommendation on Open Science, adopted by member states as a shared global framework. It treats open science not as a single act of sharing but as an integrated set of practices and values — open access to publications, open research data, open-source software, open infrastructures, open engagement with society — underpinned by transparency, equity and inclusion. Its scope is the point: it frames openness as a culture spanning the entire research process, not a box ticked at publication, and provides a common reference for understanding open science as a coherent lifecycle.

    The beginning: preregistration

    Openness can begin before any data are collected. Preregistration is the practice of specifying a study’s hypotheses, methods and analysis plan in advance, and recording that plan in a way that cannot be quietly changed later. Its purpose is to strengthen the integrity of research by making clear what was planned before the results were known, which guards against practices such as reshaping hypotheses to fit the data or selectively reporting only what worked. A particularly developed form is the registered report, in which a study’s plan is peer-reviewed and accepted in principle before the results exist, so that publication depends on the quality of the question and method rather than on whether the findings turn out to be striking. Preregistration makes the research process transparent from the outset and sets the foundation for everything that follows.

    The middle: open and FAIR data

    As a project generates data, openness shifts to how that data is managed and shared. The widely adopted FAIR principles hold that data should be Findable, Accessible, Interoperable and Reusable — properties that let data be discovered, understood and built upon by others rather than locked away or lost. Making data FAIR, and as open as is responsible, transforms it from a private by-product of one study into a lasting resource for the community. This stage connects backwards and forwards: data shared openly allows the results derived from it to be checked, and it allows the data itself to feed new research it was never collected for. Openness in the middle of the lifecycle is what gives a project value beyond its own conclusions.

    The output: open access

    When findings are written up, openness turns to open access — making the resulting publications freely available to read rather than locked behind paywalls. It can be achieved through different routes, including publishing in open-access venues and depositing accepted manuscripts in repositories, but the principle is constant: research that anyone can read can be verified, used and built upon by the widest possible audience. Open access is the most visible face of open science, but within the lifecycle it is one stage among several. A paper that is open but rests on hidden data and an undisclosed plan is less open than it appears; open access is most meaningful when it sits atop preregistration and open data.

    The long term: preservation

    The lifecycle does not end at publication, because outputs that are open today are worthless tomorrow if they vanish. Digital preservation is the work of ensuring that data, publications, software and other outputs remain accessible, intact and usable over the long term, against the threats of format obsolescence, link rot, storage failure and institutional change. There is little point making research open if it cannot be found or opened a decade later. Trusted repositories, persistent identifiers and active preservation practices are what keep the open record open over time, closing the loop so that the openness built earlier actually endures.

    The lifecycle as a connected whole

    The deeper point is that these stages reinforce one another. Preregistration makes the eventual open data and open publication more meaningful, because the plan they can be checked against is on record. Open data makes the open publication verifiable. Preservation makes all of it durable. Openness at one stage is weakened when a stage is missing — open access over secret data, or open data with no preservation, each falls short of the whole. This is why open science is best understood as a lifecycle rather than a checklist: its value is cumulative and connected, exactly the vision the UNESCO Recommendation articulates. Our learning resources explore each practice in more depth.

    A consistent vocabulary across the lifecycle

    For openness to connect across stages and systems, the information describing each stage must mean the same thing everywhere — the status of a preregistration, the access conditions of data, the licence on a publication, the preservation state of an output. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the open-science attributes of a project are understood identically across the systems that record them. And because contribution runs through every stage, the work done at each can be described in the same shared framework — the CRediT taxonomy and its full set of contribution roles. Open science is not a single act but a way of working across the whole life of a project; its power lies in the connection of its parts.

  • Repository certification: CoreTrustSeal and the markers of a trustworthy repository

    Suppose a researcher does everything right: they prepare their data carefully, document it well, and deposit it in a repository so that others can find and reuse it. They have discharged their responsibility — but only if the repository itself can be trusted to keep the data safe, accessible and intelligible for the long term. A repository that quietly disappears, loses files, lets metadata rot or cannot maintain access over time turns a careful deposit into a wasted effort. The question “is this repository trustworthy?” is therefore foundational to the whole edifice of open and FAIR data, and it deserves a more rigorous answer than a reassuring name and a working website. Repository certification provides that rigorous, auditable answer. This article examines what makes a repository trustworthy and how that is assessed, drawing on the data infrastructure domain of the CASRAI Dictionary.

    Why trust must be demonstrated, not assumed

    Trustworthiness in a repository is not a vague quality; it is a set of concrete capabilities and commitments that can be examined. Can the repository sustain itself financially and organisationally over the long term, or might it vanish when a grant ends? Does it have proper procedures for preserving files as formats and technologies change? Does it manage metadata, identifiers and access in ways that keep data findable and usable? Does it have the technical infrastructure to protect against loss and corruption? Because these are real, checkable properties, trustworthiness can be assessed against defined criteria rather than taken on faith. Certification exists precisely to make that assessment systematic: to let a repository demonstrate, and an outside party verify, that it meets recognised standards of good stewardship.

    CoreTrustSeal

    The most widely used certification in the research-data world is CoreTrustSeal. It offers a core-level certification for trustworthy data repositories, based on a set of requirements covering the organisational, technical and procedural dimensions of running a repository responsibly. A repository seeking the seal documents how it meets each requirement — covering matters such as its mission and continuity arrangements, its handling of data integrity and authenticity, its preservation planning, its arrangements for access and reuse, and its technical infrastructure — and submits this for peer review against the standard. The result is a certification that gives depositors, funders and reusers a credible signal that the repository operates to recognised standards. CoreTrustSeal’s strength is that it is community-based, internationally recognised and pitched at a level that is demanding yet achievable, making it a practical baseline for trustworthy repositories across disciplines.

    The TRUST Principles

    Complementing the formal certification are the TRUST Principles for digital repositories, which articulate, at the level of principle, what a trustworthy repository should embody. The acronym captures five qualities:

    • Transparency — being open about the repository’s terms, conditions and the extent of its services, so users know what to expect.
    • Responsibility — taking responsibility for the data held and for serving the user community, including stewardship and adherence to standards.
    • User focus — meeting the needs and standards of the community the repository serves.
    • Sustainability — ensuring the continuity of services and the preservation of data over the long term.
    • Technology — providing the infrastructure and capabilities needed to secure and preserve the data and serve its users.

    The TRUST Principles are deliberately a useful counterpart to the better-known FAIR principles: where FAIR describes properties the data should have, TRUST describes properties the repository should have. Data can only be reliably FAIR if it lives somewhere that is itself trustworthy, which is why the two sets of principles are best understood together.

    The standards landscape: nestor, ISO 16363 and OAIS

    CoreTrustSeal sits within a broader ecosystem of preservation standards, and understanding that ecosystem clarifies what certification rests upon. At the foundation is the OAIS reference model (the Open Archival Information System), a conceptual framework that defines the functions and responsibilities of a long-term digital archive — how information is ingested, stored, managed, preserved and made accessible over time. OAIS provides the shared mental model that much preservation practice and certification draws upon. Building on this foundation are more demanding certifications for repositories that need to demonstrate a higher level of assurance: the nestor Seal, an extended certification based on a German standard for trustworthy digital archives, and ISO 16363, an international standard for the audit and certification of trustworthy digital repositories. These represent a tiered landscape — CoreTrustSeal as an accessible core level, and nestor and ISO 16363 as more rigorous, resource-intensive options for archives requiring formal, audited assurance. A repository can choose the level appropriate to its role and resources.

    What certification means for researchers

    For a working researcher, this apparatus translates into a simple, practical piece of guidance: deposit data in a certified, trustworthy repository wherever possible. A certification such as CoreTrustSeal is a signal a depositor can rely on without auditing the repository themselves — evidence that the place receiving their data is run to recognised standards and is likely to keep that data safe and usable for the long term. It also helps satisfy the expectations of funders and journals, which increasingly ask that data be deposited in trustworthy repositories rather than just anywhere convenient. The wider expectations around data deposit and reuse are part of what we cover in our learning resources.

    A consistent vocabulary for trustworthy infrastructure

    For repository certification and trust signals to be meaningful across disciplines, funders and institutions, the concepts involved must be described consistently — what certification a repository holds, what preservation commitments it makes, and how those map to the requirements depositors and funders care about. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that information about repositories and their trustworthiness is understood the same way wherever it is recorded. And because depositing, curating and stewarding data are genuine contributions to research, they can be described in the same framework used for every output — the CRediT taxonomy and its full set of contribution roles, data curation foremost among them. FAIR data needs a trustworthy home; certification is how we know a home deserves the name.