Tag: sensitive data

  • Cybersecurity for sensitive research: protecting data and infrastructure

    When people speak of research security, they often mean the screening of partnerships and personnel for risks of foreign interference, undue influence or illegitimate transfer of knowledge. That is one important strand. But beneath it lies a distinct and equally vital discipline that deserves to be considered in its own right: the cybersecurity of research — protecting research data, systems and infrastructure from compromise, theft, tampering and disruption. A laboratory can pass every partnership review and still lose its most valuable data to an intrusion, a ransomware attack or a misconfigured server. Foreign-interference screening asks who you work with; research cybersecurity asks how well you protect what you hold. This article treats the second question as a discipline in its own right, drawing on the research-security domain of the CASRAI Dictionary.

    Why research is a target

    Research environments are attractive targets and, historically, soft ones. They hold things of real value: unpublished findings, novel methods, valuable datasets, intellectual property with commercial or strategic worth, and sensitive data about people. At the same time, the culture of research — open, collaborative, internationally connected, organised around sharing rather than locking down — can sit uneasily with rigorous security practice, and academic systems are often diverse, decentralised and unevenly maintained. The consequences of compromise are serious: data can be stolen, results can be quietly altered (undermining their integrity in ways that may not be detected), systems can be held to ransom, and the trust of participants whose data was promised protection can be betrayed. Recognising research as a genuine target is the first step towards protecting it, and it reframes cybersecurity not as an IT inconvenience but as a condition of doing trustworthy research at all.

    Classifying data by sensitivity

    Effective protection begins with knowing what you have and how sensitive it is. Data classification is the practice of sorting data into categories according to how damaging its exposure or loss would be — from open data that can be freely shared, through internal data, to controlled or sensitive data requiring strict protection. Classification matters because not all data warrants the same controls, and trying to protect everything to the highest standard is neither practical nor wise. By identifying which data is genuinely sensitive — personal data, controlled information, commercially or strategically valuable material — an organisation can apply proportionate safeguards: the strongest controls where the stakes are highest, lighter touch where data is open. Classification is the foundation on which every other control rests, because you cannot protect appropriately what you have not first understood.

    The NIST frameworks

    Among the most influential tools for organising cybersecurity are those from the United States National Institute of Standards and Technology (NIST). The NIST Cybersecurity Framework provides a widely adopted structure for managing cybersecurity risk, organised around core functions — broadly, identifying assets and risks, protecting them, detecting incidents, responding to them and recovering. Its value is that it gives an organisation a coherent way to think about its whole security posture rather than a scattered set of technical fixes. For research handling certain categories of controlled information, NIST SP 800-171 is especially relevant: it sets out requirements for protecting controlled unclassified information (CUI) in non-federal systems, and compliance with it is often a condition of holding certain sensitive or government-related research data. Where a project handles such data, meeting these requirements is not optional good practice but a contractual and sometimes legal obligation.

    ISO/IEC 27001 and information security management

    Internationally, the dominant standard is ISO/IEC 27001, which specifies requirements for an information security management system — a systematic, organisation-wide approach to managing information security risks through policies, controls and continual improvement. Rather than prescribing a fixed checklist, ISO/IEC 27001 requires an organisation to assess its risks and implement appropriate controls, and to manage security as an ongoing process subject to review and improvement. Certification against it provides external assurance that an organisation manages information security to a recognised standard, which can matter when research partners, funders or data providers need confidence that data they share will be properly protected. Whereas the NIST framework offers a structure for thinking about risk and SP 800-171 a set of requirements for a specific data category, ISO/IEC 27001 provides a management system for security as a whole — and the three are frequently used together.

    Where cybersecurity meets trusted research

    Research cybersecurity does not sit apart from the broader research-security agenda; it underpins it. The Trusted Research approach, which helps researchers collaborate internationally while managing risk, depends on sound information security as one of its foundations — there is little point screening a partnership for risk if the data at stake is left poorly protected. Protecting sensitive data also intersects with the governance of controlled-access data: secure infrastructure, classification and access control are what make it possible to hold and reuse sensitive data responsibly rather than either exposing it or refusing to use it at all. Cybersecurity is thus the practical backbone that lets research be both open where it can be and protected where it must be.

    A consistent vocabulary for protection

    For data sensitivity and protection requirements to be respected as data moves between institutions, collaborators and systems, the terms involved — classification levels, access conditions, sensitivity categories, control requirements — must mean the same thing everywhere. A dataset marked “controlled” in one system must be understood the same way in the next, or its protection breaks down at the boundary. That consistency is what the CASRAI Dictionary works towards: a shared vocabulary so that the metadata describing how data must be protected travels intact. And because securing and stewarding research data is genuine, skilled work, it can be described in the same shared framework as any other contribution — the CRediT taxonomy and the wider apparatus of research administration. Screening who you work with is necessary; protecting what you hold is just as necessary — and cybersecurity is the discipline that makes the second possible.

  • Sensitive and controlled-access data: FAIR for data that cannot be fully open

    The push for open research data has been one of the defining movements in scholarly practice, and rightly so: openly available data is easier to verify, reuse and build upon. But an unqualified call to make all data open runs into an immovable obstacle. A great deal of research data is sensitive — patient records, genetic information, data about vulnerable people, commercially confidential material, data whose release could cause harm — and such data cannot simply be posted on the open web without breaching the law, betraying participants’ trust, or endangering people. The challenge is not to choose between openness and protection but to honour both: to make sensitive data as accessible as it responsibly can be while keeping it as protected as it must be. This article looks at how that balance is struck, drawing on the compliance and regulatory domain of the CASRAI Dictionary.

    As open as possible, as closed as necessary

    The principle that has come to govern this territory is captured in a single phrase: data should be “as open as possible, as closed as necessary”. The phrase does real work. It establishes openness as the default and the goal — the burden falls on reasons to restrict, not on reasons to share. But it also acknowledges, plainly, that necessity sometimes requires closure, and that protecting people and honouring legal and ethical obligations is not a failure of openness but a condition of doing research responsibly. The aim, then, is not a binary of open versus closed but a spectrum of access arrangements, each calibrated to what a particular dataset requires. Sensitive data does not fall off the map of good data practice; it occupies a different, carefully governed part of it.

    FAIR does not mean open

    A common misconception is that the FAIR principles — Findable, Accessible, Interoperable, Reusable — are a synonym for “open”. They are not, and the distinction matters most for sensitive data. FAIR is about good stewardship and discoverability, not unconditional availability. Sensitive data can and should be made findable: its existence, described by rich metadata, can be advertised openly even when the data itself is restricted, so that researchers know it exists and could request it. It can be made accessible in the FAIR sense — meaning that the procedure for obtaining access is clearly defined and the conditions are transparent — even when access is granted only to approved requesters under controlled conditions. And it can be made interoperable and reusable through standardised description and clear licensing. The key move is to separate the metadata, which can be fully open, from the data, whose access is controlled. Open metadata over protected data is the architecture that lets sensitive data participate in the FAIR ecosystem without being exposed.

    Controlled access and data-access committees

    The mechanism that delivers this is controlled access. Rather than downloading the data freely, a researcher applies for it, stating who they are, what they intend to do, and agreeing to conditions on use. The application is assessed — often by a data-access committee, a body charged with deciding whether a proposed use is legitimate, ethical, and consistent with the consent under which the data were collected. Approved access typically comes with safeguards: data-use agreements that bind the recipient, restrictions on re-identification and onward sharing, and increasingly the requirement to analyse the data within a secure environment rather than taking a copy away. These arrangements let valuable data be reused while keeping the people behind it protected and the original consent respected. The committee and the agreement are not bureaucratic obstacles for their own sake; they are the means by which trust is maintained between research and the people whose data make it possible.

    Synthetic data as a bridge

    One increasingly important technique deserves attention: synthetic data. Synthetic data is artificially generated to resemble a real dataset’s structure and statistical properties without containing any real individual’s information. Because it contains no real records, it can often be shared far more openly than the sensitive data it mirrors. Its value is practical: researchers can develop and test their analysis code against synthetic data, others can understand a dataset’s shape before applying for the real thing, and methods can be demonstrated without exposing anyone. Synthetic data is not a perfect substitute — conclusions must ultimately be drawn from real data, and a poorly generated synthetic set can mislead — but as a bridge between the need to share and the duty to protect, it is a genuinely useful addition to the toolkit.

    The role of secure infrastructure

    Making controlled access work at scale depends on the infrastructure that supports it: trusted repositories that hold sensitive data securely, secure analysis environments where data can be worked on without being copied out, and the identifier and metadata systems that let restricted data be described openly and cited when used. This is the territory of the data infrastructure domain, and it is what turns the principle of controlled access from an aspiration into a practical reality. Without secure places to hold the data and clear ways to describe it, the careful balance of access and protection cannot be maintained.

    A consistent vocabulary for access and protection

    For all of this to function across institutions, funders and repositories, the terms involved must mean the same thing everywhere. Access conditions, consent categories, licence terms and protection requirements have to be described consistently, or a dataset marked as controlled-access in one system will be misunderstood in another — with real consequences when the data are sensitive. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the metadata describing how sensitive data may be accessed and reused is understood identically wherever it appears. And because reusing controlled-access data is genuine, recognisable contribution, the work of curating and stewarding it can be described using the same framework as any other — the CRediT taxonomy and its full set of contribution roles. Sensitive data is not a problem to be hidden but a resource to be governed; done well, governance is what lets research honour both openness and the people it serves.

  • Trusted Research Environments and the Five Safes: working safely with sensitive data

    Some of the most valuable research data in existence — linked health records, administrative data about whole populations, tax and benefit records, detailed information about individuals’ lives — is also some of the most sensitive. It can answer questions nothing else can, yet it cannot responsibly be copied, emailed or downloaded to an analyst’s machine, because doing so would scatter highly personal information across uncontrolled devices and betray the trust of the people it describes. The dominant answer to this dilemma is to invert the usual model: instead of bringing the data to the researcher, bring the researcher to the data. This is what a Trusted Research Environment does, and the Five Safes framework is the structure that lets everyone reason about whether such an arrangement is genuinely safe. Both sit within the research security domain of the CASRAI Dictionary.

    What a Trusted Research Environment is

    A Trusted Research Environment (TRE), also called a Secure Data Environment or secure data enclave, is a controlled computing setting in which approved researchers can analyse sensitive data without being able to remove it. The data stays inside; the analyst logs in remotely and works on it through the environment’s own tools. Code and queries run against the data within the secure walls, and only checked, aggregated results — never the raw records — are permitted to leave. The shift is profound. In the old model, access meant possession; once you held a copy, the custodian had lost control of it. In the TRE model, access is separated from possession: researchers can do everything they need to do with the data while never holding it. That separation is what makes it possible to grant meaningful access to genuinely sensitive material without accepting the risk that it leaks.

    The Five Safes framework

    A secure environment is necessary but not sufficient. Safe use of sensitive data depends on far more than the technology, and the Five Safes framework, developed originally at the UK’s Office for National Statistics and now used internationally, captures the full set of dimensions that have to be managed together:

    • Safe people. Are the researchers trustworthy, trained and accountable? Access is granted to vetted, often accredited individuals who understand and accept their obligations.
    • Safe projects. Is the proposed use appropriate, lawful and in the public interest? Each project is assessed before access is granted, not waved through.
    • Safe settings. Does the environment itself prevent unauthorised access or removal of data? This is the TRE’s technical and physical security.
    • Safe data. Has the data been treated to reduce risk — minimised, de-identified or otherwise protected to the degree the project requires?
    • Safe outputs. Are the results that leave the environment checked to ensure they cannot reveal anything about an individual? This is statistical disclosure control on the way out.

    The power of the framework is that it makes risk a property of the whole system rather than any single control. A relaxation on one dimension can be balanced by tightening another; a weakness on one is not hidden by strength elsewhere. It gives data custodians, researchers and the public a shared language for asking, and answering, “is this safe?”

    Real environments in practice

    These ideas are not theoretical. Several established environments demonstrate the model at scale. The ONS Secure Research Service provides accredited researchers with secure access to de-identified data for projects serving the public good. The SAIL Databank in Wales links and provides anonymised population data within a trusted environment for health and population research. OpenSAFELY took the principle further during a period of intense need: rather than moving records into an environment at all, it lets researchers run analysis code against electronic health records inside the secure systems where those records already live, with all the code published openly for scrutiny. Bodies such as Health Data Research UK (HDR UK) have worked to align practice across such environments so that they meet common expectations rather than each inventing its own rules. Together these show that the model works — that society can extract enormous research value from sensitive data while keeping faith with the people behind it.

    Transparency as a safeguard

    One feature of the more advanced environments deserves emphasis, because it marks a real advance in trustworthiness: transparency of analysis. When the code that runs against sensitive data is itself published, anyone can see exactly what was done. This serves two ends at once. It makes the research reproducible and auditable, which is good scientific practice. And it provides public accountability for the use of data the public has entrusted to researchers — people can see what is being done with information about them. Transparency does not weaken security; it strengthens the social licence on which the whole enterprise depends. The most defensible position is not secrecy about what is done with sensitive data, but openness about the method combined with strict control of the data itself.

    How TREs relate to open data

    It would be a mistake to read TREs as a retreat from open research; they are better understood as the mechanism that lets sensitive data participate in good data practice at all. The metadata describing what a TRE holds can be openly published, so researchers know the data exists and can apply to use it; the analysis code can be open; the results, once disclosure-checked, are shared. What stays controlled is only the irreducible core — the personal records themselves. This is the familiar principle of being as open as possible and as closed as necessary, made operational. The wider questions of working with controlled material are explored in our writing on research administration.

    A consistent vocabulary for safe access

    For TREs to interoperate and for the Five Safes to be applied consistently, the terms involved — access conditions, accreditation status, output-checking requirements, data sensitivity categories — must mean the same thing across institutions and environments. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the governance information surrounding sensitive data is understood identically wherever it travels. And because analysing data within a TRE is genuine, recognisable research contribution, the work can be described using the same framework as any other — the CRediT taxonomy and its full set of contribution roles. Trusted Research Environments and the Five Safes together show that protecting people and enabling discovery are not opposing goals but two halves of doing sensitive research well.

  • Federated analysis: bringing computation to the data

    The default model of data analysis is straightforward: gather the data you need into one place, then run your analysis on it. For a great deal of research this works perfectly well. But for some of the most valuable data in existence — patient health records, genomic data, sensitive social and administrative registries — gathering it into one place is precisely the problem. Such data is often legally, ethically and practically impossible to move freely: it cannot be copied across borders or handed to external researchers without breaching privacy law and the trust of the people it describes. The conventional model assumes the data can come to the analysis. When it cannot, research seems stuck. Federated analysis offers a way out by inverting the model entirely, and it represents an important development in the data infrastructure domain of the CASRAI Dictionary.

    The core idea: send the code, not the data

    The central insight of federated analysis is deceptively simple: instead of bringing the data to the computation, bring the computation to the data. The data stays where it is — in the hospital, the registry, the institution that holds it and is responsible for it — and the analysis is sent to run against it in place. What travels back is not the raw data but the results of the analysis: aggregate statistics, model parameters, summaries. Multiple sites can each run the same analysis on their own local data, and the results are combined to produce an answer that draws on all of them — without any site ever exposing or releasing its underlying records. The researcher gets the benefit of analysing data from many sources; the data never leaves the places entitled to hold it. This reversal is what makes collaboration possible across data that could never be pooled.

    DataSHIELD

    A well-established framework embodying this approach is DataSHIELD. DataSHIELD enables the remote, non-disclosive analysis of sensitive data: researchers can run statistical analyses across data held at multiple sites without the individual-level data ever being seen or transferred. It is designed so that only aggregate, non-disclosive results are returned — the system is built to prevent queries that could expose information about individuals. DataSHIELD has been used particularly in health and biomedical research, where the data is among the most sensitive and the barriers to pooling are highest. It is a concrete demonstration that meaningful joint analysis across institutions is achievable without anyone surrendering control of their data.

    The Personal Health Train

    Another influential conception is the Personal Health Train (PHT), which offers a memorable metaphor for the same principle. In this image, the data stays in “stations” — the institutions that hold it — and analyses travel between them like “trains” that visit each station, run their computation on the local data, and move on, carrying results rather than data. The Personal Health Train frames federated analysis as an infrastructure pattern: a way of organising data and analyses so that the data remains under the governance of its custodians while still being available, in a controlled way, for legitimate research. It emphasises that the data custodians retain authority — deciding which analyses may visit and run — which is essential for maintaining trust and meeting legal obligations. The metaphor has helped communicate the concept to the clinical and governance communities whose buy-in federated approaches require.

    Federated learning

    A closely related idea, prominent in machine learning, is federated learning: training a model across multiple decentralised data sources without centralising the data. Each site trains on its own local data and shares only model updates, which are combined to build a model that has effectively learned from all the data without any of it being gathered together. Federated learning applies the bring-computation-to-the-data principle to the training of models specifically, and it has attracted intense interest precisely because so much of the data that would make models better is data that cannot be pooled. It is the same philosophy — keep the data local, move only what is non-disclosive — applied to a particularly data-hungry kind of computation.

    Data minimisation by design

    What ties these approaches together is the principle of data minimisation: the idea that you should use and move the minimum data necessary for a given purpose. Federated analysis is, in a sense, data minimisation built into the architecture. Rather than copying entire datasets around and trusting everyone downstream to handle them responsibly, it ensures that the sensitive data simply never moves, and that only the minimal, non-disclosive results are shared. This has clear advantages:

    • Privacy. Individuals’ records stay protected because they are never exposed or transferred.
    • Governance. Data custodians retain control and can meet their legal and ethical obligations to the people whose data they hold.
    • Scale. Research can draw on data from many institutions and jurisdictions that could never agree to pool their data centrally.

    Working with data that cannot be open

    Federated analysis sits within the broader challenge of doing valuable research on data that cannot be fully open. It is a powerful answer to the question of how sensitive data can be reused for the public good without being exposed: the data can be analysed and learned from while remaining as protected as it must be. This complements, rather than replaces, controlled-access arrangements and secure environments; it is another tool for reconciling the duty to protect with the desire to discover. Sound research administration increasingly has to account for these arrangements when planning sensitive-data projects.

    A consistent vocabulary for federated work

    For federated analysis to work across institutions, the descriptions of what is being analysed and shared must be consistent. Data dictionaries must align so that a variable means the same thing at every station; access conditions, governance terms and the nature of returned results must be described in compatible ways, or a federated analysis cannot reliably combine results across sites. That consistency is what the CASRAI Dictionary supports: a shared vocabulary so that the metadata describing federated data and analyses is understood identically wherever it travels. And because building, running and curating federated analyses is genuine contribution, the work can be described in the same framework used for every other — the CRediT taxonomy and its set of contribution roles. Federated analysis shows that the choice between using data and protecting it is sometimes a false one: with the right architecture, you can do both.