Tag: data appraisal

  • Digital sustainability: the environmental cost of data storage and preservation

    The instinct in modern research is to keep everything. Storage is cheap, deletion feels risky, and the principles of openness and reproducibility seem to counsel retaining as much as possible for as long as possible. But this instinct conceals a real and growing cost. Storing data, running computations and preserving digital material for the long term all consume energy, and energy carries a carbon footprint. The cloud is not a weightless abstraction; it is data centres drawing power and demanding cooling, somewhere, continuously. As research becomes ever more data-intensive, the environmental cost of its digital life — storage, computation, preservation — can no longer be treated as invisible. Digital sustainability is the discipline of taking that cost seriously, and it is the subject of this article, which draws on the sustainable-research domain of the CASRAI Dictionary.

    The hidden cost of keeping everything

    The first thing digital sustainability asks us to see is that “keep it just in case” is not a cost-free default. Every dataset retained indefinitely occupies storage that must be powered, cooled, maintained, migrated to new media over time, and backed up — and the aggregate of countless such decisions across the research system is substantial. There is a real tension here with the open-data ideal. The drive to make data findable and reusable is valuable, but it can shade into digital hoarding: keeping vast quantities of low-value data on the vague principle that more is always better, without asking whether a dataset is worth its ongoing cost. The FAIR principles call for data to be findable and reusable — not for everything to be kept forever regardless of value. Distinguishing data worth preserving from data that need not be is itself an act of stewardship, not a betrayal of openness.

    Appraisal and data minimisation

    The practices that respond to this are appraisal and data minimisation. Appraisal — long established in the archival and records-management traditions — is the disciplined process of deciding what to keep, for how long, and what may responsibly be discarded, based on enduring value rather than reflex. Data minimisation, familiar also from data protection, is the principle of collecting and retaining only what is genuinely needed. Applied to research, these practices mean making conscious decisions: which raw data must be preserved to support published results and which intermediate files can be regenerated if ever needed; which datasets have lasting reuse value and which were transient. This is not an argument for carelessly deleting valuable data — the cost of losing irreplaceable data far exceeds the cost of storing it. It is an argument for deciding, deliberately and well, rather than defaulting to indiscriminate retention. Good appraisal keeps what matters and lets go of what does not, serving both sustainability and the long-term usability of the record.

    Green software and computation

    Storage is only part of the picture; computation has its own footprint. The green software movement — advanced by organisations such as the Green Software Foundation — aims to reduce the environmental impact of software itself. A central concept is Software Carbon Intensity (SCI), a specification for measuring the carbon emissions associated with running software, so that the impact can be quantified, compared and reduced rather than guessed at. For research, the principles translate into practical questions: is a computation more efficient than it needs to be; is it run repeatedly when results could be cached; is the workload run where and when the energy is cleaner? Efficient, well-considered computation is not only cheaper and faster but less carbon-intensive, and measuring impact, as SCI encourages, is the precondition for managing it.

    Preservation that lasts: OAIS

    Sustainability is not only about using less; it is also about preserving well, so that what is kept genuinely endures and the energy spent keeping it is not wasted. The reference model for long-term digital preservation is OAIS — the Open Archival Information System reference model — which provides a framework for what a trustworthy digital archive must do to preserve information over the long term and keep it accessible and understandable to future users. OAIS matters to digital sustainability in two ways. First, preservation is itself an ongoing activity with an environmental cost, and doing it according to a sound model means that cost buys real durability rather than slow decay. Second, preserving fewer things well — properly described, in sustainable formats, in a trustworthy archive — is far better, environmentally and intellectually, than preserving many things badly, where data accumulates and yet quietly becomes unusable through neglect. Good preservation and disciplined appraisal are two sides of the same sustainable practice.

    Sustainability and FAIR, properly understood

    None of this is in conflict with FAIR or with open research, properly understood. FAIR is about good stewardship — making the data that is worth keeping findable, accessible, interoperable and reusable — not about hoarding. A sustainable approach is, in fact, a more honest expression of FAIR: it concentrates effort on the data that genuinely merits it, rather than spreading thin attention and real resources across everything indiscriminately. Sustainability and good data stewardship point in the same direction: keep what matters, describe it well, preserve it properly, and let go of what does not earn its keep.

    A consistent vocabulary for digital sustainability

    For sustainable practice to be applied consistently — across repositories, institutions and funders — the concepts involved, such as retention periods, appraisal decisions, preservation levels and format requirements, must be described in ways that mean the same thing everywhere. That consistency is what the CASRAI Dictionary works towards: a shared vocabulary so that decisions about what to keep, how to preserve it and for how long are understood the same way wherever they are recorded. And because appraising, curating and preserving data well is genuine, skilled work, it can be described in the same shared framework as any other contribution — the CRediT taxonomy and the wider apparatus of research administration. The most sustainable digital research is not the research that stores the least, but the research that decides most carefully what is worth keeping — and then keeps it well.

  • Data lifecycle management: the DCC Curation Lifecycle Model

    Research data is often treated as if it has only two moments that matter: when it is collected and when it is published. Everything in between is left to chance. Yet data that is well collected but poorly managed can become unusable within a few years: file formats fall out of support, the meaning of variables is forgotten, copies multiply and diverge, and the person who understood it moves on. Treating data as a thing to be looked after across its whole existence, rather than captured once and forgotten, is the essence of data lifecycle management. The most influential map of that lifecycle is the Digital Curation Centre’s Curation Lifecycle Model, which provides a structured way to think about the journey data takes — a journey at the heart of the research-lifecycle domain of the CASRAI Dictionary.

    Why curation is continuous

    The central insight of the lifecycle view is that curation is an active, continuous process, not a one-off task performed at the end. It is tempting to imagine that data can be generated freely and tidied up later. In practice, the decisions that determine whether data will survive and remain usable are made throughout: how it is structured and documented as it is created, how it is stored while in use, what is kept and what is discarded, and how it is prepared for the long term. Leaving all of this to the end means leaving it too late — documentation that was obvious at the time is forgotten, and choices that should have been deliberate are made by default. The Digital Curation Centre, a UK centre of expertise, developed its model precisely to make these activities visible and deliberate across the whole life of the data.

    The shape of the model

    The Curation Lifecycle Model is usually drawn as a series of concentric rings around the data at the centre. At its core sit the digital objects and databases being curated. Surrounding them are full lifecycle actions — activities that apply throughout, not at a single stage. These include description and representation information (the metadata and documentation that make data understandable), preservation planning, community watch and participation (keeping up with standards and tools), and the overarching work of curating and preserving. Around these run the sequential actions that the data passes through over time. The genius of the model is in holding both ideas at once: some curation work happens at particular moments in sequence, while other work — above all documentation and preservation planning — must be sustained continuously throughout.

    The sequence of actions

    The sequential part of the model traces data through its life:

    • Conceptualise. Plan how data will be created and managed before any of it exists — the planning a data management plan captures, a discipline introduced at our learning hub.
    • Create or receive. Generate the data, or take it in, with the metadata and documentation it needs from the outset.
    • Appraise and select. Decide which data should be kept for the long term, judged against guidance and policy. Not everything need be preserved forever; deciding deliberately is itself curation.
    • Ingest. Transfer the selected data into a repository or archive that will look after it.
    • Preservation action. Take the steps that keep data usable over time — format migration, integrity checks and the rest.
    • Store. Keep the data securely and reliably.
    • Access, use and reuse. Make the data available to those entitled to it, for the purposes that justify keeping it.
    • Transform. Create new data from the original, which then re-enters the lifecycle in its own right.

    The model also includes occasional actions — reappraisal, migration and, where appropriate, disposal of data that should not be retained — acknowledging that curation involves honest decisions about what not to keep as well as what to preserve.

    Appraisal: the decision at the centre of curation

    Of all these stages, appraisal and selection deserves particular emphasis, because it is where lifecycle thinking departs most sharply from the instinct to keep everything. Storing data indefinitely is neither free nor harmless: it consumes resources, and a vast undifferentiated mass of poorly described data is hard to use. Appraisal is the disciplined judgement about what has lasting value — what should be preserved because it could be reused, verified or is too costly to reproduce — and what can responsibly be let go. Making that judgement well, against clear policy, is one of the most professional acts in data management, and the lifecycle model puts it where it belongs: a deliberate decision point, not an accident of neglect.

    Preservation in service of reuse

    It is worth being clear about why all this effort is undertaken. The point of preservation is not to lock data away but to keep it usable, because the ultimate purpose of curation is reuse. Data that has been appraised, documented, preserved and made accessible can be verified by others, combined with new data, and built upon in ways its creators never anticipated. This is the payoff that justifies the whole lifecycle: well-curated data is an asset that keeps giving, while neglected data is a sunk cost that decays. The model makes the connection explicit by placing reuse alongside preservation, a reminder that curation serves a purpose beyond mere safekeeping.

    A consistent vocabulary across the lifecycle

    For data to move smoothly through these stages — across the tools, repositories and systems involved — the information describing it must mean the same thing at every step. Metadata created at capture must be understood by the repository that ingests it; reuse depends on description that travels intact. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so the information accompanying data is understood identically wherever it flows. And because curating data is genuine, recognisable contribution, the work can be described using the same framework as any other — the CRediT taxonomy, whose Data curation role names exactly this activity. The lifecycle model shows that good data does not happen by accident; sustained curation, supported by shared description, turns data collected once into data usable for years.