Synthetic data privacy is achievable for FAIR-compliant sharing only when data generation is paired with a secure data environment and a formal statistical guarantee such as differential privacy. Synthetic records alone do not satisfy the General Data Protection Regulation’s anonymisation test, because generative models can retain traces of the real data they were trained on.
Synthetic data is artificial information produced by a model trained on a real dataset, engineered to reproduce that dataset’s statistical structure without containing any single individual’s actual record.
Institutions holding clinical trial records, patient registries or HR data face a genuine conflict: FAIR principles push toward accessible, reusable outputs, while the GDPR pushes toward the narrowest possible disclosure of personal data. Synthetic data is often marketed as the technology that resolves this tension. Recent regulatory and research literature says it narrows the gap but does not close it alone.
- What is synthetic data, and how does it map to the FAIR principles?
- Does synthetic data satisfy GDPR’s anonymisation standard?
- How do secure data environments complement synthetic data?
- What does differential privacy add to synthetic data pipelines?
- Frequently asked questions
- What should institutions do next?
What is synthetic data, and how does it map to the FAIR principles?
Synthetic data can advance all four FAIR data principles set out by Wilkinson et al. in the 2016 FAIR Guiding Principles paper, but unevenly. It strengthens Findability and Accessibility fastest, since a synthetic proxy can be indexed and downloaded with far fewer legal barriers than the source. Interoperability and Reusability depend more on how faithfully the generation model preserves structure.
| FAIR principle | What synthetic data contributes |
|---|---|
| Findable | A citable, publicly indexable surrogate dataset with rich metadata, while the source stays access-controlled |
| Accessible | Open or low-barrier download, removing the need for a data access committee for exploratory work |
| Interoperable | Same schema and controlled vocabularies as the source, so pipelines and tools can be built and tested in advance |
| Reusable | Supports method development, teaching and model training without repeated re-applications for the real data |
The catch is quality drift. A synthetic dataset that has been aggressively de-identified to reduce re-identification risk typically loses the rare-event structure that made the original data valuable, which undermines Reusability even as it improves Accessibility.
Does synthetic data satisfy GDPR’s anonymisation standard?
Not automatically. Under GDPR Recital 26, data is anonymous — and therefore outside the regulation’s scope — only if the data subject is no longer identifiable “by any means reasonably likely to be used”. Generative models can memorise unusual or rare records from their training data, and those traces can resurface in synthetic outputs.
The European Data Protection Supervisor’s TechSonar assessment states plainly that synthetic data is not per se anonymous and can still reflect biases or leak information from the source. The UK Information Commissioner’s Office reaches a parallel conclusion in its anonymisation, pseudonymisation and privacy-enhancing technologies guidance: generating a synthetic dataset from personal data is itself processing, requiring a lawful basis and an assessment of residual identifiability — it does not become anonymous by construction. Most synthetic datasets sit closer to pseudonymised data than to true anonymisation, keeping them inside GDPR’s scope rather than exempting them from it.
- The generation step itself is processing of personal data and needs a lawful basis (typically legitimate interest or a research-specific condition).
- Rare or unique combinations of attributes in the source data are the most common source of residual re-identification risk in the output.
- A documented disclosure risk assessment — not vendor assurance — is what a regulator or ethics committee will expect to see before publication.
How do secure data environments complement synthetic data?
A secure data environment (SDE), also called a trusted research environment (TRE), keeps sensitive data in place and lets approved researchers run analysis against it remotely, with only vetted outputs allowed to leave. This is the model the Goldacre Review — commissioned by the Department of Health and Social Care and published in April 2022 as “Better, Broader, Safer: using health data for research and analysis” — recommended as the default access route for NHS data instead of distributing dataset copies. NHS England’s subsequent Secure Data Environment policy formalised this, requiring health and social care data for research to be accessed through approved SDEs rather than by dissemination.
Synthetic data and SDEs are complementary, not competing, tiers of the same access model. A well-designed pipeline uses openly released synthetic data for code development and hypothesis-generation, then reserves the real data — accessed inside the SDE — for the analysis that actually informs a publication or policy decision. Two UK examples show this pattern already in production:
- Simulacrum, built by the National Cancer Registration and Analysis Service and now maintained via Health Data Insight, is a synthetic cancer-registry dataset that lets researchers write and test analysis code before requesting access to the real registry data inside a TRE.
- OpenSAFELY issues researchers with dummy datasets that mirror the structure of NHS primary-care records, so code is fully written and reviewed before it ever runs against real patient data inside the secure environment.
This tiering directly resolves the FAIR-versus-GDPR conflict for the “Accessible” and “Reusable” principles: the synthetic layer is genuinely open, while the sensitive layer never leaves controlled infrastructure.
| Mechanism | Where the sensitive data sits | GDPR status | Best FAIR fit |
|---|---|---|---|
| Open synthetic release | Never leaves the generation pipeline | Requires disclosure-risk assessment; rarely fully anonymous | Findable, Accessible |
| Secure/trusted data environment | Stays on controlled infrastructure at all times | Personal data processed under strict access controls and a lawful basis | Interoperable, Reusable |
| Differentially private release | Leaves as noised aggregates or a noised synthetic model | Stronger anonymisation argument, quantifiable via the privacy budget | Accessible, Reusable |
What does differential privacy add to synthetic data pipelines?
Differential privacy adds a mathematical guarantee that no single training record materially changed the output, expressed through a privacy budget parameter (epsilon). A smaller epsilon gives a stronger guarantee but degrades statistical utility, so the choice of epsilon is a governance decision, not just a technical one. The US National Institute of Standards and Technology’s guidelines for evaluating differential privacy guarantees (SP 800-226) set out how organisations should document and justify that choice rather than treat it as a default setting.
Applied to synthetic data generation — for example through differentially private training of the generative model — this converts a vague “we anonymised it” claim into an auditable parameter that a data protection officer or ethics committee can evaluate. That auditability is what most synthetic-data-and-GDPR commentary skips, and it is the biggest lever institutions have for turning synthetic data into a defensible compliance position rather than a marketing claim.
Frequently asked questions
Is synthetic data a risk of privacy?
Yes. Synthetic data is not automatically private: generative models can memorise rare records from the source, and re-identification remains possible through linkage with other datasets. The Royal Society’s 2024 review of synthetic data found that privacy cannot be verified by comparison with real data alone, so every release needs a documented risk assessment.
What is synthetic personal data?
Synthetic personal data is artificial data generated by a model trained on real personal records, reproducing statistical patterns without a direct link to any individual. Under GDPR Recital 26, it counts as anonymous only if re-identification is reasonably impossible; otherwise it remains pseudonymised personal data subject to full GDPR obligations.
How does synthetic data protect privacy?
It protects privacy by replacing real records with generated ones that preserve aggregate statistical properties while breaking the direct record-to-person link. Adding differential privacy noise during generation gives a mathematical bound on how much any individual’s data could have influenced the output, strengthening the guarantee beyond generation alone.
What are synthetic data examples in research?
UK examples include Simulacrum, NHS England’s synthetic cancer-registry dataset built from National Cancer Registration and Analysis Service records, and OpenSAFELY‘s dummy datasets, which let researchers write analysis code before running it inside a secure data environment against the real data.
What should institutions do next?
Research offices, data custodians and publishers should stop treating “synthetic” as a synonym for “anonymous” in data management plans. A defensible strategy states explicitly: which tier — open synthetic, SDE-mediated, or differentially private release — applies to which output; a documented disclosure-risk assessment for any synthetic release; and, where a formal guarantee is used, the epsilon value and its justification. Research data governance frameworks increasingly expect this specificity rather than a blanket “anonymised” claim.
Through 2026, expect funders and journals to converge on synthetic-plus-SDE tiering as the default for sensitive datasets, with open synthetic release reserved for lower-risk data and differential privacy applied wherever a genuinely open output is required. Institutions documenting their tiering decisions now will be better placed as reviewers start asking for that evidence as standard.