Explainer · Plain-language

What is data anonymisation?

Data anonymisation is the process of removing or altering information so that individuals can no longer be identified from a dataset, with the result that the data is no longer personal data. Done effectively, it takes the data outside the scope of data-protection law such as the UK GDPR. It is distinct from pseudonymisation, which only replaces identifiers with a key and leaves the data still personal. Achieving genuine anonymisation is difficult because of the risk of re-identification.

CASRAI plain-language explainers — clear answers to recurring research-administration questions

The short answer

Anonymisation transforms personal data so that individuals can no longer be identified, directly or indirectly, after which the data falls outside the UK GDPR. It involves removing or altering both direct identifiers (such as names and identification numbers) and indirect identifiers (such as combinations of postcode, date of birth, and occupation that could single someone out). Techniques include aggregation, generalisation, suppression, and approaches such as k-anonymity and l-diversity. It differs from pseudonymisation, which is reversible with additional information and remains personal data. The UK Information Commissioner's Office (ICO) issues guidance, and the UK Anonymisation Network (UKAN) offers practical frameworks; re-identification risk must always be assessed.

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.

Direct and indirect identifiers

Identifying information falls into two broad categories. Direct identifiers single out an individual on their own — names, addresses, identification numbers, email addresses. Indirect (or quasi-) identifiers do not identify a person alone but can do so in combination, such as the joint pattern of postcode, date of birth, and sex. Effective anonymisation must address both. Removing names is rarely enough, because the remaining indirect identifiers can be linked with other available data to re-identify individuals. Assessing which combinations of variables create identification risk is central to the task.

Pseudonymisation versus anonymisation under UK GDPR

Under the UK GDPR, pseudonymisation means processing personal data so that it can no longer be attributed to a specific person without additional information — a key — that is kept separately (Article 4(5)). Crucially, pseudonymised data is still personal data and remains within the scope of the law, because the key makes re-identification possible. Anonymisation, by contrast, aims to make identification impossible, so the data is no longer personal data and the UK GDPR no longer applies (a position reflected in Recital 26). Pseudonymisation is therefore best understood as a security and risk-reduction measure, not a route to exemption.

Techniques: k-anonymity and l-diversity

A range of techniques is used to reduce identification risk. Aggregation reports only group-level figures; generalisation replaces precise values with broader ranges (an exact age becomes an age band); suppression removes particularly risky values. k-anonymity is a formal model in which each record is indistinguishable from at least k−1 others on its quasi-identifiers, so no individual stands out. l-diversity extends this by ensuring that sensitive attributes within each group are sufficiently varied, guarding against the case where everyone in a group shares the same sensitive value. These models help structure decisions but do not eliminate residual risk.

Guidance, UKAN, and re-identification risk

In the UK, the Information Commissioner's Office (ICO) provides guidance on anonymisation and pseudonymisation, historically through its anonymisation code of practice and subsequent guidance. The UK Anonymisation Network (UKAN) offers practical frameworks, notably a functional, context-aware approach to assessing and managing re-identification risk. A recurring theme is that anonymisation is not a one-off technical fix but a judgement about risk in context: how the data will be released, what other data exists, and who might try to re-identify individuals. Because new data and methods can raise re-identification risk over time, anonymisation should be assessed against the realistic threat environment rather than treated as permanent and absolute.

Key facts

At a glance

Definition: altering data so individuals can no longer be identified
Legal effect: anonymised data is no longer personal data (outside UK GDPR)
Identifiers: must address both direct and indirect (quasi-) identifiers
Versus pseudonymisation: pseudonymised data is reversible and still personal data
Techniques: aggregation, generalisation, suppression, k-anonymity, l-diversity
UK guidance: ICO anonymisation guidance; frameworks from UKAN

Common misconceptions

What people often get wrong

Often heard: Pseudonymisation and anonymisation are the same thing.

Actually: No — pseudonymisation replaces identifiers with a key and is reversible, so the data remains personal data under the UK GDPR. Anonymisation aims to make identification impossible, taking the data outside the law.

Often heard: Removing names anonymises a dataset.

Actually: No — indirect identifiers such as combinations of postcode, date of birth, and occupation can re-identify individuals. Effective anonymisation must address these too.

Often heard: Once anonymised, data carries no re-identification risk ever again.

Actually: No — re-identification risk depends on context and can rise as new data and methods emerge, so it must be assessed against the realistic threat environment rather than assumed permanent.

Going deeper