Tag: research data governance

  • Sensitive and controlled-access data: FAIR for data that cannot be fully open

    The push for open research data has been one of the defining movements in scholarly practice, and rightly so: openly available data is easier to verify, reuse and build upon. But an unqualified call to make all data open runs into an immovable obstacle. A great deal of research data is sensitive — patient records, genetic information, data about vulnerable people, commercially confidential material, data whose release could cause harm — and such data cannot simply be posted on the open web without breaching the law, betraying participants’ trust, or endangering people. The challenge is not to choose between openness and protection but to honour both: to make sensitive data as accessible as it responsibly can be while keeping it as protected as it must be. This article looks at how that balance is struck, drawing on the compliance and regulatory domain of the CASRAI Dictionary.

    As open as possible, as closed as necessary

    The principle that has come to govern this territory is captured in a single phrase: data should be “as open as possible, as closed as necessary”. The phrase does real work. It establishes openness as the default and the goal — the burden falls on reasons to restrict, not on reasons to share. But it also acknowledges, plainly, that necessity sometimes requires closure, and that protecting people and honouring legal and ethical obligations is not a failure of openness but a condition of doing research responsibly. The aim, then, is not a binary of open versus closed but a spectrum of access arrangements, each calibrated to what a particular dataset requires. Sensitive data does not fall off the map of good data practice; it occupies a different, carefully governed part of it.

    FAIR does not mean open

    A common misconception is that the FAIR principles — Findable, Accessible, Interoperable, Reusable — are a synonym for “open”. They are not, and the distinction matters most for sensitive data. FAIR is about good stewardship and discoverability, not unconditional availability. Sensitive data can and should be made findable: its existence, described by rich metadata, can be advertised openly even when the data itself is restricted, so that researchers know it exists and could request it. It can be made accessible in the FAIR sense — meaning that the procedure for obtaining access is clearly defined and the conditions are transparent — even when access is granted only to approved requesters under controlled conditions. And it can be made interoperable and reusable through standardised description and clear licensing. The key move is to separate the metadata, which can be fully open, from the data, whose access is controlled. Open metadata over protected data is the architecture that lets sensitive data participate in the FAIR ecosystem without being exposed.

    Controlled access and data-access committees

    The mechanism that delivers this is controlled access. Rather than downloading the data freely, a researcher applies for it, stating who they are, what they intend to do, and agreeing to conditions on use. The application is assessed — often by a data-access committee, a body charged with deciding whether a proposed use is legitimate, ethical, and consistent with the consent under which the data were collected. Approved access typically comes with safeguards: data-use agreements that bind the recipient, restrictions on re-identification and onward sharing, and increasingly the requirement to analyse the data within a secure environment rather than taking a copy away. These arrangements let valuable data be reused while keeping the people behind it protected and the original consent respected. The committee and the agreement are not bureaucratic obstacles for their own sake; they are the means by which trust is maintained between research and the people whose data make it possible.

    Synthetic data as a bridge

    One increasingly important technique deserves attention: synthetic data. Synthetic data is artificially generated to resemble a real dataset’s structure and statistical properties without containing any real individual’s information. Because it contains no real records, it can often be shared far more openly than the sensitive data it mirrors. Its value is practical: researchers can develop and test their analysis code against synthetic data, others can understand a dataset’s shape before applying for the real thing, and methods can be demonstrated without exposing anyone. Synthetic data is not a perfect substitute — conclusions must ultimately be drawn from real data, and a poorly generated synthetic set can mislead — but as a bridge between the need to share and the duty to protect, it is a genuinely useful addition to the toolkit.

    The role of secure infrastructure

    Making controlled access work at scale depends on the infrastructure that supports it: trusted repositories that hold sensitive data securely, secure analysis environments where data can be worked on without being copied out, and the identifier and metadata systems that let restricted data be described openly and cited when used. This is the territory of the data infrastructure domain, and it is what turns the principle of controlled access from an aspiration into a practical reality. Without secure places to hold the data and clear ways to describe it, the careful balance of access and protection cannot be maintained.

    A consistent vocabulary for access and protection

    For all of this to function across institutions, funders and repositories, the terms involved must mean the same thing everywhere. Access conditions, consent categories, licence terms and protection requirements have to be described consistently, or a dataset marked as controlled-access in one system will be misunderstood in another — with real consequences when the data are sensitive. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the metadata describing how sensitive data may be accessed and reused is understood identically wherever it appears. And because reusing controlled-access data is genuine, recognisable contribution, the work of curating and stewarding it can be described using the same framework as any other — the CRediT taxonomy and its full set of contribution roles. Sensitive data is not a problem to be hidden but a resource to be governed; done well, governance is what lets research honour both openness and the people it serves.