Skip to main content
v2026.1714 entries · CC-BY 4.0
CASRAI

Editorial · CASRAI · Research data infrastructure

Anonymising research data: k-anonymity, differential privacy and the re-identification risk

Sharing data about people without exposing the people themselves is one of the hardest problems in research data management. This article distinguishes anonymisation from pseudonymisation, explains the privacy models researchers actually use, k-anonymity, l-diversity and differential privacy, and introduces the practical guidance from the UK Anonymisation Network (UKAN) and the ICO’s anonymisation code. It also confronts the uncomfortable reality that re-identification is often easier than it looks.

ByCASRAI Editorial Board
Published 21 Jun 2026· Last updated 21 Jun 2026· 5 minute read

Much of the most valuable research data is also the most sensitive: health records, survey responses, administrative data about individuals. Sharing it advances science, but sharing it carelessly can expose the very people it describes. The discipline that sits between these two goods, anonymisation, is more technical and more fragile than the word suggests. Done well, it allows safe reuse; done casually, it offers a false reassurance that data is protected when in fact individuals can be picked back out.

Anonymisation is not pseudonymisation

The first distinction is legal and practical. Pseudonymisation replaces direct identifiers, such as names, with a key or token, but the link back to the individual still exists, held separately. Under data-protection law, including the UK GDPR, pseudonymised data remains personal data, because re-identification is possible by anyone with access to the key. It is a valuable security measure, but it does not remove a record from the scope of data-protection obligations.

True anonymisation aims to render data no longer personal at all, such that an individual cannot be identified by any party reasonably likely to try, taking account of other information that may be available. If genuinely achieved, anonymised data falls outside the core of data-protection law. The catch is in the words reasonably likely: anonymisation is not a binary state achieved by deleting a name, but a judgement about residual risk in a specific context, which is why it is hard to get right and easy to overstate.

The privacy models

Researchers draw on a small family of formal models to reason about that residual risk.

  • k-anonymity. A dataset is k-anonymous if every record is indistinguishable from at least k minus one others with respect to the quasi-identifiers, the attributes such as age, postcode or occupation that, in combination, could single someone out. Achieving it usually means generalising values, for example reporting an age band instead of an exact age, or suppressing rare values. k-anonymity guards against picking out a single individual, but it has a known weakness: if all the records in a group share the same sensitive value, an attacker learns that value without needing to identify the specific person.
  • l-diversity. This extends k-anonymity to address that weakness by requiring that each group of indistinguishable records contains a diversity of sensitive values, so that membership of a group does not reveal a sensitive attribute. It is a refinement aimed squarely at the homogeneity problem that k-anonymity alone does not solve.
  • Differential privacy. A fundamentally different and more rigorous approach, differential privacy adds carefully calibrated statistical noise to results or data so that the presence or absence of any single individual makes almost no difference to what is released. Its formal guarantee is about the mechanism, not just the output: it bounds how much can be learned about any one person regardless of what auxiliary information an attacker holds. This makes it powerful for releasing aggregate statistics, though the added noise trades some accuracy for that protection.

These models are complementary rather than competing. k-anonymity and l-diversity reason about the structure of a released microdata table; differential privacy reasons about the process that generates released figures. Choosing among them depends on what is being shared and to whom.

UKAN and the ICO code

Formal models need to be translated into practice, and in the United Kingdom two sources do that work. The UK Anonymisation Network (UKAN) provides practical guidance, training and a structured way of thinking about anonymisation as a context-dependent risk-management activity rather than a one-off technical fix. Its framework stresses that the same data can be safe to share in one environment and unsafe in another, so decisions must consider the data, the recipients and the controls around access together.

The Information Commissioner’s Office (ICO), the UK data-protection regulator, has likewise produced guidance on anonymisation and pseudonymisation that explains the legal status of each and what organisations must consider. The throughline of both is the same: anonymisation is a spectrum of risk, judged against who might reasonably try to re-identify and what else they could bring to bear, not a switch that is simply flipped to off.

The re-identification risk

The reason all this caution is warranted is that re-identification has repeatedly proved easier than data holders expected. Datasets stripped of obvious identifiers have been re-identified by linking them to other available information, because the combination of a few seemingly innocuous attributes, a date, a location, a rare characteristic, can be unique to one person. This is the linkage attack, and it is why quasi-identifiers, not just direct identifiers, must be managed. The lesson is that data does not become safe simply because the names are gone; safety depends on how unique the remaining combinations are and on what an adversary could plausibly match them against.

For researchers, the practical implications are clear. Treat anonymisation as a risk assessment specific to the data and the sharing context, not a checkbox. Prefer formally grounded methods, choosing k-anonymity and l-diversity for microdata releases and differential privacy where strong, attacker-agnostic guarantees on aggregate outputs are needed. Combine technical measures with controls on who can access the data and under what terms, in the spirit of safe-environment approaches. And remember that pseudonymised data remains personal data with all the obligations that entails. Handled this way, sensitive data can be shared responsibly, supporting the reuse goals of FAIR data without trading away the privacy of the individuals whose lives the data describes. Consistent definitions, of the kind a CASRAI data dictionary promotes, help ensure that everyone in the chain means the same thing by anonymised, pseudonymised and identifiable.

Referenced across the research world

University of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logoUniversity of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logo
  • University of Cambridge logo
  • Columbia University logo
  • University of Edinburgh logo
  • Harvard University logo
  • University of Oxford logo
  • Princeton University logo
  • Stanford School of Medicine logo
  • University College London logo
  • ORCID logo
  • Crossref logo

View CASRAI adoption →