Tag: anonymisation

  • Anonymising research data: k-anonymity, differential privacy and the re-identification risk

    Much of the most valuable research data is also the most sensitive: health records, survey responses, administrative data about individuals. Sharing it advances science, but sharing it carelessly can expose the very people it describes. The discipline that sits between these two goods, anonymisation, is more technical and more fragile than the word suggests. Done well, it allows safe reuse; done casually, it offers a false reassurance that data is protected when in fact individuals can be picked back out.

    Anonymisation is not pseudonymisation

    The first distinction is legal and practical. Pseudonymisation replaces direct identifiers, such as names, with a key or token, but the link back to the individual still exists, held separately. Under data-protection law, including the UK GDPR, pseudonymised data remains personal data, because re-identification is possible by anyone with access to the key. It is a valuable security measure, but it does not remove a record from the scope of data-protection obligations.

    True anonymisation aims to render data no longer personal at all, such that an individual cannot be identified by any party reasonably likely to try, taking account of other information that may be available. If genuinely achieved, anonymised data falls outside the core of data-protection law. The catch is in the words reasonably likely: anonymisation is not a binary state achieved by deleting a name, but a judgement about residual risk in a specific context, which is why it is hard to get right and easy to overstate.

    The privacy models

    Researchers draw on a small family of formal models to reason about that residual risk.

    • k-anonymity. A dataset is k-anonymous if every record is indistinguishable from at least k minus one others with respect to the quasi-identifiers, the attributes such as age, postcode or occupation that, in combination, could single someone out. Achieving it usually means generalising values, for example reporting an age band instead of an exact age, or suppressing rare values. k-anonymity guards against picking out a single individual, but it has a known weakness: if all the records in a group share the same sensitive value, an attacker learns that value without needing to identify the specific person.
    • l-diversity. This extends k-anonymity to address that weakness by requiring that each group of indistinguishable records contains a diversity of sensitive values, so that membership of a group does not reveal a sensitive attribute. It is a refinement aimed squarely at the homogeneity problem that k-anonymity alone does not solve.
    • Differential privacy. A fundamentally different and more rigorous approach, differential privacy adds carefully calibrated statistical noise to results or data so that the presence or absence of any single individual makes almost no difference to what is released. Its formal guarantee is about the mechanism, not just the output: it bounds how much can be learned about any one person regardless of what auxiliary information an attacker holds. This makes it powerful for releasing aggregate statistics, though the added noise trades some accuracy for that protection.

    These models are complementary rather than competing. k-anonymity and l-diversity reason about the structure of a released microdata table; differential privacy reasons about the process that generates released figures. Choosing among them depends on what is being shared and to whom.

    UKAN and the ICO code

    Formal models need to be translated into practice, and in the United Kingdom two sources do that work. The UK Anonymisation Network (UKAN) provides practical guidance, training and a structured way of thinking about anonymisation as a context-dependent risk-management activity rather than a one-off technical fix. Its framework stresses that the same data can be safe to share in one environment and unsafe in another, so decisions must consider the data, the recipients and the controls around access together.

    The Information Commissioner’s Office (ICO), the UK data-protection regulator, has likewise produced guidance on anonymisation and pseudonymisation that explains the legal status of each and what organisations must consider. The throughline of both is the same: anonymisation is a spectrum of risk, judged against who might reasonably try to re-identify and what else they could bring to bear, not a switch that is simply flipped to off.

    The re-identification risk

    The reason all this caution is warranted is that re-identification has repeatedly proved easier than data holders expected. Datasets stripped of obvious identifiers have been re-identified by linking them to other available information, because the combination of a few seemingly innocuous attributes, a date, a location, a rare characteristic, can be unique to one person. This is the linkage attack, and it is why quasi-identifiers, not just direct identifiers, must be managed. The lesson is that data does not become safe simply because the names are gone; safety depends on how unique the remaining combinations are and on what an adversary could plausibly match them against.

    For researchers, the practical implications are clear. Treat anonymisation as a risk assessment specific to the data and the sharing context, not a checkbox. Prefer formally grounded methods, choosing k-anonymity and l-diversity for microdata releases and differential privacy where strong, attacker-agnostic guarantees on aggregate outputs are needed. Combine technical measures with controls on who can access the data and under what terms, in the spirit of safe-environment approaches. And remember that pseudonymised data remains personal data with all the obligations that entails. Handled this way, sensitive data can be shared responsibly, supporting the reuse goals of FAIR data without trading away the privacy of the individuals whose lives the data describes. Consistent definitions, of the kind a CASRAI data dictionary promotes, help ensure that everyone in the chain means the same thing by anonymised, pseudonymised and identifiable.

  • GDPR and research data: lawful bases, consent and pseudonymisation

    An enormous amount of research depends on data about people — their health, their behaviour, their genetics, their opinions, their lives. Wherever such data identify or could identify individuals, they fall within data protection law, and in Europe and the United Kingdom that law is the General Data Protection Regulation (GDPR), supplemented in the UK by the UK GDPR and the Data Protection Act 2018. For researchers the GDPR is sometimes experienced as a thicket of obligations. But its core ideas are coherent, and it contains specific provisions designed to enable responsible research rather than obstruct it. Understanding lawful bases, the special rules for sensitive data, the research exemptions, and the distinction between anonymisation and pseudonymisation is part of doing data-driven research properly. This article offers an orientation, drawing on the compliance and regulatory domain of the CASRAI Dictionary. It is general guidance, not legal advice.

    You need a lawful basis

    The first principle is that processing personal data is not permitted by default; it requires a lawful basis. Article 6 of the GDPR sets out the possible bases, several of which can be relevant to research. Many researchers assume the answer is always consent, but for research by public institutions a basis such as the performance of a task carried out in the public interest is often more appropriate. The choice matters because different bases carry different consequences for the rights individuals can exercise. The key point is that a researcher must be able to identify and justify the lawful basis on which they process personal data — good intentions and scientific value do not by themselves make processing lawful.

    Special category data and Article 9

    Much research data is not merely personal but sensitive — data about health, genetics, ethnicity, sexual life, religious or political beliefs, and so on. The GDPR calls these special categories and gives them extra protection under Article 9, which prohibits their processing unless a specific additional condition is met. Among those conditions are explicit consent and, importantly for research, processing necessary for scientific research purposes subject to appropriate safeguards. This means that to process sensitive data lawfully, a researcher must satisfy both a lawful basis under Article 6 and a condition under Article 9. The heightened protection reflects the heightened risk: misuse of health or genetic data can cause serious harm, and the law accordingly demands a stronger justification and stronger safeguards before such data may be used.

    The research provisions

    The GDPR explicitly recognises the value of research and contains provisions, centred on Article 89, intended to facilitate it while protecting individuals. These measures allow certain flexibilities under conditions — for example, data collected for one purpose may in some circumstances be further processed for scientific research without that being treated as incompatible with the original purpose, and certain individual rights may be adjusted where they would seriously impair research objectives. Crucially, these provisions are not a free pass. They are conditioned on appropriate safeguards for the rights and freedoms of individuals — safeguards that the regulation specifically associates with techniques such as data minimisation and, prominently, pseudonymisation. The research exemptions, in other words, come bundled with the expectation that researchers will take concrete measures to protect the people in their data.

    Anonymisation versus pseudonymisation

    One distinction does more practical work in research than almost any other, and it is frequently misunderstood: the difference between anonymisation and pseudonymisation.

    • Anonymisation means rendering data such that individuals are no longer identifiable, by anyone, taking account of all means reasonably likely to be used. Genuinely anonymous data falls outside the scope of the GDPR altogether, because it is no longer personal data. Achieving true anonymisation is harder than it sounds, because seemingly innocuous combinations of fields can re-identify people.
    • Pseudonymisation means processing data so that it can no longer be attributed to an individual without additional information — for example, replacing names with a code, while keeping the key that links code to identity separate and secure. Pseudonymised data remains personal data and remains within the GDPR’s scope, because re-identification is still possible with the key.

    The error to avoid is treating pseudonymised data as if it were anonymous and therefore outside the law. Pseudonymisation is a valuable safeguard — indeed the GDPR commends it — but it reduces risk rather than removing the data from regulation. Knowing which one you have done determines what obligations still apply.

    Accountability and impact assessments

    The GDPR is built on accountability: it is not enough to comply, one must be able to demonstrate compliance. For research using personal data this brings practical obligations — documenting the lawful basis and Article 9 condition, being transparent with participants, applying data minimisation, and securing the data. Where processing is likely to result in a high risk to individuals — as large-scale processing of sensitive data often will — a data protection impact assessment (DPIA) may be required, identifying the risks and planning mitigations before processing begins. The DPIA is not merely a form to file; it is the moment at which a team thinks systematically about how its use of personal data could affect people and how to reduce that effect.

    A consistent vocabulary for compliance

    Data protection touches institutions, funders, ethics committees and repositories alike, and for the relevant information to be handled consistently across them, the terms involved — lawful basis, consent type, special category, pseudonymised, anonymised, retention — must mean the same thing everywhere. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the compliance metadata describing how personal data may be used is understood identically wherever it appears, supporting the broader machinery of research administration. And because stewarding personal data responsibly is genuine contribution, that work can be described within the same framework as any other — the CRediT taxonomy and its full set of contribution roles. The GDPR is not the enemy of research; properly understood, it is the framework within which research that depends on people’s data can be done in a way that keeps faith with them.