Tag: data minimisation

  • Digital sustainability: the environmental cost of data storage and preservation

    The instinct in modern research is to keep everything. Storage is cheap, deletion feels risky, and the principles of openness and reproducibility seem to counsel retaining as much as possible for as long as possible. But this instinct conceals a real and growing cost. Storing data, running computations and preserving digital material for the long term all consume energy, and energy carries a carbon footprint. The cloud is not a weightless abstraction; it is data centres drawing power and demanding cooling, somewhere, continuously. As research becomes ever more data-intensive, the environmental cost of its digital life — storage, computation, preservation — can no longer be treated as invisible. Digital sustainability is the discipline of taking that cost seriously, and it is the subject of this article, which draws on the sustainable-research domain of the CASRAI Dictionary.

    The hidden cost of keeping everything

    The first thing digital sustainability asks us to see is that “keep it just in case” is not a cost-free default. Every dataset retained indefinitely occupies storage that must be powered, cooled, maintained, migrated to new media over time, and backed up — and the aggregate of countless such decisions across the research system is substantial. There is a real tension here with the open-data ideal. The drive to make data findable and reusable is valuable, but it can shade into digital hoarding: keeping vast quantities of low-value data on the vague principle that more is always better, without asking whether a dataset is worth its ongoing cost. The FAIR principles call for data to be findable and reusable — not for everything to be kept forever regardless of value. Distinguishing data worth preserving from data that need not be is itself an act of stewardship, not a betrayal of openness.

    Appraisal and data minimisation

    The practices that respond to this are appraisal and data minimisation. Appraisal — long established in the archival and records-management traditions — is the disciplined process of deciding what to keep, for how long, and what may responsibly be discarded, based on enduring value rather than reflex. Data minimisation, familiar also from data protection, is the principle of collecting and retaining only what is genuinely needed. Applied to research, these practices mean making conscious decisions: which raw data must be preserved to support published results and which intermediate files can be regenerated if ever needed; which datasets have lasting reuse value and which were transient. This is not an argument for carelessly deleting valuable data — the cost of losing irreplaceable data far exceeds the cost of storing it. It is an argument for deciding, deliberately and well, rather than defaulting to indiscriminate retention. Good appraisal keeps what matters and lets go of what does not, serving both sustainability and the long-term usability of the record.

    Green software and computation

    Storage is only part of the picture; computation has its own footprint. The green software movement — advanced by organisations such as the Green Software Foundation — aims to reduce the environmental impact of software itself. A central concept is Software Carbon Intensity (SCI), a specification for measuring the carbon emissions associated with running software, so that the impact can be quantified, compared and reduced rather than guessed at. For research, the principles translate into practical questions: is a computation more efficient than it needs to be; is it run repeatedly when results could be cached; is the workload run where and when the energy is cleaner? Efficient, well-considered computation is not only cheaper and faster but less carbon-intensive, and measuring impact, as SCI encourages, is the precondition for managing it.

    Preservation that lasts: OAIS

    Sustainability is not only about using less; it is also about preserving well, so that what is kept genuinely endures and the energy spent keeping it is not wasted. The reference model for long-term digital preservation is OAIS — the Open Archival Information System reference model — which provides a framework for what a trustworthy digital archive must do to preserve information over the long term and keep it accessible and understandable to future users. OAIS matters to digital sustainability in two ways. First, preservation is itself an ongoing activity with an environmental cost, and doing it according to a sound model means that cost buys real durability rather than slow decay. Second, preserving fewer things well — properly described, in sustainable formats, in a trustworthy archive — is far better, environmentally and intellectually, than preserving many things badly, where data accumulates and yet quietly becomes unusable through neglect. Good preservation and disciplined appraisal are two sides of the same sustainable practice.

    Sustainability and FAIR, properly understood

    None of this is in conflict with FAIR or with open research, properly understood. FAIR is about good stewardship — making the data that is worth keeping findable, accessible, interoperable and reusable — not about hoarding. A sustainable approach is, in fact, a more honest expression of FAIR: it concentrates effort on the data that genuinely merits it, rather than spreading thin attention and real resources across everything indiscriminately. Sustainability and good data stewardship point in the same direction: keep what matters, describe it well, preserve it properly, and let go of what does not earn its keep.

    A consistent vocabulary for digital sustainability

    For sustainable practice to be applied consistently — across repositories, institutions and funders — the concepts involved, such as retention periods, appraisal decisions, preservation levels and format requirements, must be described in ways that mean the same thing everywhere. That consistency is what the CASRAI Dictionary works towards: a shared vocabulary so that decisions about what to keep, how to preserve it and for how long are understood the same way wherever they are recorded. And because appraising, curating and preserving data well is genuine, skilled work, it can be described in the same shared framework as any other contribution — the CRediT taxonomy and the wider apparatus of research administration. The most sustainable digital research is not the research that stores the least, but the research that decides most carefully what is worth keeping — and then keeps it well.

  • Federated analysis: bringing computation to the data

    The default model of data analysis is straightforward: gather the data you need into one place, then run your analysis on it. For a great deal of research this works perfectly well. But for some of the most valuable data in existence — patient health records, genomic data, sensitive social and administrative registries — gathering it into one place is precisely the problem. Such data is often legally, ethically and practically impossible to move freely: it cannot be copied across borders or handed to external researchers without breaching privacy law and the trust of the people it describes. The conventional model assumes the data can come to the analysis. When it cannot, research seems stuck. Federated analysis offers a way out by inverting the model entirely, and it represents an important development in the data infrastructure domain of the CASRAI Dictionary.

    The core idea: send the code, not the data

    The central insight of federated analysis is deceptively simple: instead of bringing the data to the computation, bring the computation to the data. The data stays where it is — in the hospital, the registry, the institution that holds it and is responsible for it — and the analysis is sent to run against it in place. What travels back is not the raw data but the results of the analysis: aggregate statistics, model parameters, summaries. Multiple sites can each run the same analysis on their own local data, and the results are combined to produce an answer that draws on all of them — without any site ever exposing or releasing its underlying records. The researcher gets the benefit of analysing data from many sources; the data never leaves the places entitled to hold it. This reversal is what makes collaboration possible across data that could never be pooled.

    DataSHIELD

    A well-established framework embodying this approach is DataSHIELD. DataSHIELD enables the remote, non-disclosive analysis of sensitive data: researchers can run statistical analyses across data held at multiple sites without the individual-level data ever being seen or transferred. It is designed so that only aggregate, non-disclosive results are returned — the system is built to prevent queries that could expose information about individuals. DataSHIELD has been used particularly in health and biomedical research, where the data is among the most sensitive and the barriers to pooling are highest. It is a concrete demonstration that meaningful joint analysis across institutions is achievable without anyone surrendering control of their data.

    The Personal Health Train

    Another influential conception is the Personal Health Train (PHT), which offers a memorable metaphor for the same principle. In this image, the data stays in “stations” — the institutions that hold it — and analyses travel between them like “trains” that visit each station, run their computation on the local data, and move on, carrying results rather than data. The Personal Health Train frames federated analysis as an infrastructure pattern: a way of organising data and analyses so that the data remains under the governance of its custodians while still being available, in a controlled way, for legitimate research. It emphasises that the data custodians retain authority — deciding which analyses may visit and run — which is essential for maintaining trust and meeting legal obligations. The metaphor has helped communicate the concept to the clinical and governance communities whose buy-in federated approaches require.

    Federated learning

    A closely related idea, prominent in machine learning, is federated learning: training a model across multiple decentralised data sources without centralising the data. Each site trains on its own local data and shares only model updates, which are combined to build a model that has effectively learned from all the data without any of it being gathered together. Federated learning applies the bring-computation-to-the-data principle to the training of models specifically, and it has attracted intense interest precisely because so much of the data that would make models better is data that cannot be pooled. It is the same philosophy — keep the data local, move only what is non-disclosive — applied to a particularly data-hungry kind of computation.

    Data minimisation by design

    What ties these approaches together is the principle of data minimisation: the idea that you should use and move the minimum data necessary for a given purpose. Federated analysis is, in a sense, data minimisation built into the architecture. Rather than copying entire datasets around and trusting everyone downstream to handle them responsibly, it ensures that the sensitive data simply never moves, and that only the minimal, non-disclosive results are shared. This has clear advantages:

    • Privacy. Individuals’ records stay protected because they are never exposed or transferred.
    • Governance. Data custodians retain control and can meet their legal and ethical obligations to the people whose data they hold.
    • Scale. Research can draw on data from many institutions and jurisdictions that could never agree to pool their data centrally.

    Working with data that cannot be open

    Federated analysis sits within the broader challenge of doing valuable research on data that cannot be fully open. It is a powerful answer to the question of how sensitive data can be reused for the public good without being exposed: the data can be analysed and learned from while remaining as protected as it must be. This complements, rather than replaces, controlled-access arrangements and secure environments; it is another tool for reconciling the duty to protect with the desire to discover. Sound research administration increasingly has to account for these arrangements when planning sensitive-data projects.

    A consistent vocabulary for federated work

    For federated analysis to work across institutions, the descriptions of what is being analysed and shared must be consistent. Data dictionaries must align so that a variable means the same thing at every station; access conditions, governance terms and the nature of returned results must be described in compatible ways, or a federated analysis cannot reliably combine results across sites. That consistency is what the CASRAI Dictionary supports: a shared vocabulary so that the metadata describing federated data and analyses is understood identically wherever it travels. And because building, running and curating federated analyses is genuine contribution, the work can be described in the same framework used for every other — the CRediT taxonomy and its set of contribution roles. Federated analysis shows that the choice between using data and protecting it is sometimes a false one: with the right architecture, you can do both.