Editorial · CASRAI · Research data infrastructure

Anonymising research data: k-anonymity, differential privacy and the re-identification risk

Sharing data about people without exposing the people themselves is one of the hardest problems in research data management. This article distinguishes anonymisation from pseudonymisation, explains the privacy models researchers actually use, k-anonymity, l-diversity and differential privacy, and introduces the practical guidance from the UK Anonymisation Network (UKAN) and the ICO’s anonymisation code. It also confronts the uncomfortable reality that re-identification is often easier than it looks.

ByCASRAI Editorial Board

Published 21 Jun 2026· Last updated 21 Jun 2026· 5 minute read

Much of the most valuable research data is also the most sensitive: health records, survey responses, administrative data about individuals. Sharing it advances science, but sharing it carelessly can expose the very people it describes. The discipline that sits between these two goods, anonymisation, is more technical and more fragile than the word suggests. Done well, it allows safe reuse; done casually, it offers a false reassurance that data is protected when in fact individuals can be picked back out.

Anonymisation is not pseudonymisation

The first distinction is legal and practical. Pseudonymisation replaces direct identifiers, such as names, with a key or token, but the link back to the individual still exists, held separately. Under data-protection law, including the UK GDPR, pseudonymised data remains personal data, because re-identification is possible by anyone with access to the key. It is a valuable security measure, but it does not remove a record from the scope of data-protection obligations.

True anonymisation aims to render data no longer personal at all, such that an individual cannot be identified by any party reasonably likely to try, taking account of other information that may be available. If genuinely achieved, anonymised data falls outside the core of data-protection law. The catch is in the words reasonably likely: anonymisation is not a binary state achieved by deleting a name, but a judgement about residual risk in a specific context, which is why it is hard to get right and easy to overstate.

The privacy models

Researchers draw on a small family of formal models to reason about that residual risk.

k-anonymity. A dataset is k-anonymous if every record is indistinguishable from at least k minus one others with respect to the quasi-identifiers, the attributes such as age, postcode or occupation that, in combination, could single someone out. Achieving it usually means generalising values, for example reporting an age band instead of an exact age, or suppressing rare values. k-anonymity guards against picking out a single individual, but it has a known weakness: if all the records in a group share the same sensitive value, an attacker learns that value without needing to identify the specific person.
l-diversity. This extends k-anonymity to address that weakness by requiring that each group of indistinguishable records contains a diversity of sensitive values, so that membership of a group does not reveal a sensitive attribute. It is a refinement aimed squarely at the homogeneity problem that k-anonymity alone does not solve.
Differential privacy. A fundamentally different and more rigorous approach, differential privacy adds carefully calibrated statistical noise to results or data so that the presence or absence of any single individual makes almost no difference to what is released. Its formal guarantee is about the mechanism, not just the output: it bounds how much can be learned about any one person regardless of what auxiliary information an attacker holds. This makes it powerful for releasing aggregate statistics, though the added noise trades some accuracy for that protection.

These models are complementary rather than competing. k-anonymity and l-diversity reason about the structure of a released microdata table; differential privacy reasons about the process that generates released figures. Choosing among them depends on what is being shared and to whom.

UKAN and the ICO code

Formal models need to be translated into practice, and in the United Kingdom two sources do that work. The UK Anonymisation Network (UKAN) provides practical guidance, training and a structured way of thinking about anonymisation as a context-dependent risk-management activity rather than a one-off technical fix. Its framework stresses that the same data can be safe to share in one environment and unsafe in another, so decisions must consider the data, the recipients and the controls around access together.

The Information Commissioner’s Office (ICO), the UK data-protection regulator, has likewise produced guidance on anonymisation and pseudonymisation that explains the legal status of each and what organisations must consider. The throughline of both is the same: anonymisation is a spectrum of risk, judged against who might reasonably try to re-identify and what else they could bring to bear, not a switch that is simply flipped to off.

The re-identification risk

The reason all this caution is warranted is that re-identification has repeatedly proved easier than data holders expected. Datasets stripped of obvious identifiers have been re-identified by linking them to other available information, because the combination of a few seemingly innocuous attributes, a date, a location, a rare characteristic, can be unique to one person. This is the linkage attack, and it is why quasi-identifiers, not just direct identifiers, must be managed. The lesson is that data does not become safe simply because the names are gone; safety depends on how unique the remaining combinations are and on what an adversary could plausibly match them against.

For researchers, the practical implications are clear. Treat anonymisation as a risk assessment specific to the data and the sharing context, not a checkbox. Prefer formally grounded methods, choosing k-anonymity and l-diversity for microdata releases and differential privacy where strong, attacker-agnostic guarantees on aggregate outputs are needed. Combine technical measures with controls on who can access the data and under what terms, in the spirit of safe-environment approaches. And remember that pseudonymised data remains personal data with all the obligations that entails. Handled this way, sensitive data can be shared responsibly, supporting the reuse goals of FAIR data without trading away the privacy of the individuals whose lives the data describes. Consistent definitions, of the kind a CASRAI data dictionary promotes, help ensure that everyone in the chain means the same thing by anonymised, pseudonymised and identifiable.

Related editorial in this domain

More on Research data infrastructure

21 Jun 2026

Identifiers for Things, Not Just Papers: IGSN and PIDINST

Persistent identifiers are familiar for articles, datasets, and people, but the physical objects of research, the rock cores, water samples, and the instruments that measure them, have long lacked stable references. The IGSN for samples and the PIDINST work for instruments extend persistent identification to the physical world, making physical research objects findable, citable, and connectable to the data they produce.

20 Jun 2026

Big Data and the Vs of Data Explained for Research

Big data describes datasets so large, fast or varied that traditional tools cannot handle them. This guide explains the defining Vs, from volume and velocity to veracity and value, how distributed processing copes, and what big data means for research and FAIR data.

20 Jun 2026

Cloud Computing for Research Infrastructure

Cloud computing delivers on-demand, elastic, measured computing resources over a network. This explainer defines it using the NIST model, distinguishes IaaS, PaaS and SaaS, and weighs its role in reproducible research alongside cost and governance considerations.