Tag: london secure data environment

  • Synthetic Data Privacy: Where FAIR Meets GDPR

    Synthetic data privacy is achievable for FAIR-compliant sharing only when data generation is paired with a secure data environment and a formal statistical guarantee such as differential privacy. Synthetic records alone do not satisfy the General Data Protection Regulation’s anonymisation test, because generative models can retain traces of the real data they were trained on.

    Synthetic data is artificial information produced by a model trained on a real dataset, engineered to reproduce that dataset’s statistical structure without containing any single individual’s actual record.

    Institutions holding clinical trial records, patient registries or HR data face a genuine conflict: FAIR principles push toward accessible, reusable outputs, while the GDPR pushes toward the narrowest possible disclosure of personal data. Synthetic data is often marketed as the technology that resolves this tension. Recent regulatory and research literature says it narrows the gap but does not close it alone.

    What is synthetic data, and how does it map to the FAIR principles?

    Synthetic data can advance all four FAIR data principles set out by Wilkinson et al. in the 2016 FAIR Guiding Principles paper, but unevenly. It strengthens Findability and Accessibility fastest, since a synthetic proxy can be indexed and downloaded with far fewer legal barriers than the source. Interoperability and Reusability depend more on how faithfully the generation model preserves structure.

    FAIR principle What synthetic data contributes
    Findable A citable, publicly indexable surrogate dataset with rich metadata, while the source stays access-controlled
    Accessible Open or low-barrier download, removing the need for a data access committee for exploratory work
    Interoperable Same schema and controlled vocabularies as the source, so pipelines and tools can be built and tested in advance
    Reusable Supports method development, teaching and model training without repeated re-applications for the real data

    The catch is quality drift. A synthetic dataset that has been aggressively de-identified to reduce re-identification risk typically loses the rare-event structure that made the original data valuable, which undermines Reusability even as it improves Accessibility.

    Does synthetic data satisfy GDPR’s anonymisation standard?

    Not automatically. Under GDPR Recital 26, data is anonymous — and therefore outside the regulation’s scope — only if the data subject is no longer identifiable “by any means reasonably likely to be used”. Generative models can memorise unusual or rare records from their training data, and those traces can resurface in synthetic outputs.

    The European Data Protection Supervisor’s TechSonar assessment states plainly that synthetic data is not per se anonymous and can still reflect biases or leak information from the source. The UK Information Commissioner’s Office reaches a parallel conclusion in its anonymisation, pseudonymisation and privacy-enhancing technologies guidance: generating a synthetic dataset from personal data is itself processing, requiring a lawful basis and an assessment of residual identifiability — it does not become anonymous by construction. Most synthetic datasets sit closer to pseudonymised data than to true anonymisation, keeping them inside GDPR’s scope rather than exempting them from it.

    • The generation step itself is processing of personal data and needs a lawful basis (typically legitimate interest or a research-specific condition).
    • Rare or unique combinations of attributes in the source data are the most common source of residual re-identification risk in the output.
    • A documented disclosure risk assessment — not vendor assurance — is what a regulator or ethics committee will expect to see before publication.

    How do secure data environments complement synthetic data?

    A secure data environment (SDE), also called a trusted research environment (TRE), keeps sensitive data in place and lets approved researchers run analysis against it remotely, with only vetted outputs allowed to leave. This is the model the Goldacre Review — commissioned by the Department of Health and Social Care and published in April 2022 as “Better, Broader, Safer: using health data for research and analysis” — recommended as the default access route for NHS data instead of distributing dataset copies. NHS England’s subsequent Secure Data Environment policy formalised this, requiring health and social care data for research to be accessed through approved SDEs rather than by dissemination.

    Synthetic data and SDEs are complementary, not competing, tiers of the same access model. A well-designed pipeline uses openly released synthetic data for code development and hypothesis-generation, then reserves the real data — accessed inside the SDE — for the analysis that actually informs a publication or policy decision. Two UK examples show this pattern already in production:

    • Simulacrum, built by the National Cancer Registration and Analysis Service and now maintained via Health Data Insight, is a synthetic cancer-registry dataset that lets researchers write and test analysis code before requesting access to the real registry data inside a TRE.
    • OpenSAFELY issues researchers with dummy datasets that mirror the structure of NHS primary-care records, so code is fully written and reviewed before it ever runs against real patient data inside the secure environment.

    This tiering directly resolves the FAIR-versus-GDPR conflict for the “Accessible” and “Reusable” principles: the synthetic layer is genuinely open, while the sensitive layer never leaves controlled infrastructure.

    Mechanism Where the sensitive data sits GDPR status Best FAIR fit
    Open synthetic release Never leaves the generation pipeline Requires disclosure-risk assessment; rarely fully anonymous Findable, Accessible
    Secure/trusted data environment Stays on controlled infrastructure at all times Personal data processed under strict access controls and a lawful basis Interoperable, Reusable
    Differentially private release Leaves as noised aggregates or a noised synthetic model Stronger anonymisation argument, quantifiable via the privacy budget Accessible, Reusable

    What does differential privacy add to synthetic data pipelines?

    Differential privacy adds a mathematical guarantee that no single training record materially changed the output, expressed through a privacy budget parameter (epsilon). A smaller epsilon gives a stronger guarantee but degrades statistical utility, so the choice of epsilon is a governance decision, not just a technical one. The US National Institute of Standards and Technology’s guidelines for evaluating differential privacy guarantees (SP 800-226) set out how organisations should document and justify that choice rather than treat it as a default setting.

    Applied to synthetic data generation — for example through differentially private training of the generative model — this converts a vague “we anonymised it” claim into an auditable parameter that a data protection officer or ethics committee can evaluate. That auditability is what most synthetic-data-and-GDPR commentary skips, and it is the biggest lever institutions have for turning synthetic data into a defensible compliance position rather than a marketing claim.

    Frequently asked questions

    Is synthetic data a risk of privacy?

    Yes. Synthetic data is not automatically private: generative models can memorise rare records from the source, and re-identification remains possible through linkage with other datasets. The Royal Society’s 2024 review of synthetic data found that privacy cannot be verified by comparison with real data alone, so every release needs a documented risk assessment.

    What is synthetic personal data?

    Synthetic personal data is artificial data generated by a model trained on real personal records, reproducing statistical patterns without a direct link to any individual. Under GDPR Recital 26, it counts as anonymous only if re-identification is reasonably impossible; otherwise it remains pseudonymised personal data subject to full GDPR obligations.

    How does synthetic data protect privacy?

    It protects privacy by replacing real records with generated ones that preserve aggregate statistical properties while breaking the direct record-to-person link. Adding differential privacy noise during generation gives a mathematical bound on how much any individual’s data could have influenced the output, strengthening the guarantee beyond generation alone.

    What are synthetic data examples in research?

    UK examples include Simulacrum, NHS England’s synthetic cancer-registry dataset built from National Cancer Registration and Analysis Service records, and OpenSAFELY‘s dummy datasets, which let researchers write analysis code before running it inside a secure data environment against the real data.

    What should institutions do next?

    Research offices, data custodians and publishers should stop treating “synthetic” as a synonym for “anonymous” in data management plans. A defensible strategy states explicitly: which tier — open synthetic, SDE-mediated, or differentially private release — applies to which output; a documented disclosure-risk assessment for any synthetic release; and, where a formal guarantee is used, the epsilon value and its justification. Research data governance frameworks increasingly expect this specificity rather than a blanket “anonymised” claim.

    Through 2026, expect funders and journals to converge on synthetic-plus-SDE tiering as the default for sensitive datasets, with open synthetic release reserved for lower-risk data and differential privacy applied wherever a genuinely open output is required. Institutions documenting their tiering decisions now will be better placed as reviewers start asking for that evidence as standard.

  • NHS Secure Data Environment: 5 Regions Compared

    England does not have one NHS Secure Data Environment — it has twelve. An NHS secure data environment is a controlled analysis platform that lets approved researchers work with de-identified NHS data without ever downloading it, and eleven regional SDEs plus one national NHS England SDE now make up the NHS Research SDE Network. This article compares how the South West, London, West Midlands, Yorkshire & Humber and North West SDEs actually differ in governance, oversight and researcher onboarding.

    A Secure Data Environment (SDE) is a data storage and analysis platform that upholds the “Five Safes” framework — Safe People, Safe Projects, Safe Settings, Safe Data and Safe Outputs — so that approved researchers can analyse sensitive health and care data without it leaving NHS control. Every regional SDE follows this same national framework, but each was stood up by a different set of Integrated Care Systems (ICSs) or Integrated Care Boards (ICBs), so the governance body, the number of local partners, and the practical route a researcher takes to get access all vary by region.

    What is an NHS Secure Data Environment?

    An NHS Secure Data Environment is a remote, audited analysis platform — sometimes called a Trusted Research Environment (TRE) — where approved researchers can query de-identified NHS health and social care data without exporting it. The programme follows the 2022 government review by Professor Ben Goldacre, which recommended SDEs as the default route for NHS research data access, and was adopted as national policy under the Data Saves Lives strategy, with the SDE programme funded from 2023.

    The NHS Research SDE Network is made up of eleven regional SDEs plus one national SDE run directly by NHS England — twelve nodes in total. Every node applies the same Five Safes controls, but each regional SDE was established, funded and governed by a different local partnership of NHS organisations, which is why the access experience differs from region to region even though the underlying rules do not.

    How do the five regional SDEs compare in governance?

    The table below compares the South West, London, West Midlands, Yorkshire & Humber and North West SDEs on who governs them, how many local health partners feed into them, and what makes each one operationally distinct.

    Region Governing partnership Local partners Distinctive feature
    South West South West SDE partnership, supported by Health Innovation West of England / South West Life Sciences South West England ICSs Public-facing “Your Data. Your Choice” leaflet and patient-communication programme
    London Split leadership: North East London ICS runs the London Data Service; North West London ICS runs the London Analytics Platform, under a pan-London Independent Information Access Group (IIAG) All London ICSs, plus five sub-regional Data Access Committees (North West, North Central, North East, South East, South West London) Two-part architecture (data service + analytics platform) built on the existing Discover-NOW model running since 2018
    West Midlands West Midlands SDE partnership, owned and run by the NHS Six ICSs: Black Country, Birmingham & Solihull, Coventry & Warwickshire, Herefordshire & Worcestershire, Shropshire, Telford & Wrekin, and Staffordshire & Stoke-on-Trent Single regional “Apply to West Midlands SDE” application portal covering all six ICSs
    Yorkshire & Humber Yorkshire & Humber SDE partnership Yorkshire & Humber ICSs and data providers “Single front door” data-discovery model plus a Citizens Jury for public and patient engagement
    North West North West SDE partnership Three Integrated Care Boards covering North West England Values-led governance charter (Diverse, Open, Accountable, Inclusive); standardised researcher access process still being finalised

    The clearest structural outlier is London, which splits governance across two platforms and five sub-regional committees rather than a single regional board — a reflection of London’s five-ICS geography rather than a difference in national policy.

    What access route do researchers follow in each region?

    Every regional SDE sits inside the same national guardrails, but the practical starting point for a researcher differs. In broad terms, the steps are: register interest with the relevant regional SDE team; secure a data-sharing agreement (nationally via NHS England’s Data Access Request Service, or through the regional equivalent); have the project reviewed against the Five Safes by the regional oversight body; complete researcher accreditation and training; then work inside the controlled virtual environment until outputs are checked and cleared for removal.

    • South West: researchers apply directly through the South West SDE website, which also publishes a public leaflet explaining the region’s data-use commitments.
    • London: applications go through the IIAG or the relevant sub-regional Data Access Committee, reflecting London’s split data-service/analytics-platform model; the platform is expected to be operational from April 2026, with GP data collection for the London Data Service having begun in Spring 2025.
    • West Midlands: a single “apply now” route covers all six member ICSs, rather than requiring separate applications per local health system.
    • Yorkshire & Humber: the “single front door” model directs researchers to a published dataset catalogue on the Health Data Research Gateway before submitting a data-availability request.
    • North West: the SDE is functioning but has stated publicly that a standardised, fully documented access process is still being developed across its three ICB partners.

    For institutional research administration teams supporting multi-site studies, this means the pre-application groundwork — which regional committee to approach, and what documentation it expects — cannot be assumed to be identical just because every SDE runs the same Five Safes model.

    Common questions about NHS Secure Data Environments

    What is a secure data environment?

    A secure data environment is a data storage and analysis platform that lets approved researchers query sensitive health and care data remotely, without downloading or removing it. It is sometimes described as a Trusted Research Environment, Data Safe Haven or Databank, and is governed by the Five Safes framework.

    What is the difference between an SDE and a TRE?

    An SDE is functionally the same concept as a Trusted Research Environment (TRE): a controlled workspace for processing sensitive data. NHS communications favour the term “SDE” because it is better understood by patients and the public, while “TRE” remains common in academic and cross-sector data-access literature.

    What is an NHS SDE?

    An NHS SDE is one of the twelve platforms in the NHS Research Secure Data Environment Network — eleven regional SDEs plus one national SDE run by NHS England — that simplify and accelerate approved researchers’ secure access to NHS health and social care data.

    How many NHS Secure Data Environments are there?

    There are twelve nodes in the NHS Research SDE Network: eleven regional SDEs, including South West, London, West Midlands, Yorkshire & Humber and North West, plus one national SDE operated directly by NHS England.

    What this means for institutions and researchers

    The regional model gives each SDE room to reflect local geography and partnership history — six ICSs pooling into one West Midlands portal, five London sub-regions coordinating through an IIAG, a Citizens Jury shaping Yorkshire & Humber’s public-engagement approach. That flexibility is a deliberate design choice under the Data Saves Lives strategy, not an implementation gap.

    For institutions running multi-region studies, the practical implication is that researcher onboarding time and documentation requirements will vary by region even where the underlying data-protection standard does not. Research offices should treat “apply to the SDE” as a region-specific task, confirm which committee or portal governs the target region before drafting a data-access application, and expect the network to keep converging on more standardised processes as regions such as the North West finalise their published access routes and London’s Analytics Platform reaches full operation.