Tag: public health data

  • Incidence vs Prevalence: Key Epidemiological Measures

    Incidence and prevalence are two foundational measures in epidemiology that answer different questions about how a condition affects a population. Incidence measures how many new cases of a condition arise in a population over a period of time, capturing the rate at which cases occur. Prevalence measures how many cases exist in a population at a point in time or over a defined period, capturing the burden present. Confusing the two leads to serious misinterpretation, so the distinction is a methodological essential rather than a matter of terminology.

    Both measures rest on the same underlying ideas of a case, a population at risk, and a time reference, but they assemble those ingredients differently. Getting the definitions right is the first step to choosing the correct measure for a given research or planning question.

    How incidence is calculated

    Incidence quantifies new cases relative to a population at risk over time, and it comes in two common forms. Cumulative incidence divides the number of new cases by the number of people at risk at the start of the period, giving a proportion that approximates the average risk of developing the condition over that period. Incidence rate, sometimes called incidence density, divides new cases by the total person-time at risk, which accounts for individuals being observed for different lengths of time and for people entering or leaving the population. Both forms require defining the population at risk precisely, excluding those who already have the condition, and stating the observation window clearly. The person-time approach is particularly useful in studies where people are followed for varying durations, because each individual contributes time at risk only for as long as they are observed and remain capable of developing the condition. Expressing the result, for example, as cases per 1,000 person-years makes the time dimension explicit and allows fair comparison between groups followed for different lengths of time.

    How prevalence is calculated

    Prevalence divides the number of existing cases by the total population, counting everyone who currently has the condition regardless of when it began. Point prevalence refers to a single point in time, answering how many cases exist right now, while period prevalence covers a defined interval and counts anyone who had the condition at any time during that interval. Because prevalence includes both long-standing and recently arisen cases, it reflects the accumulated stock of cases in the population rather than the flow of new ones.

    Incidence and prevalence compared

    Feature Incidence Prevalence
    What it counts New cases arising Existing cases present
    Time element Over a period (flow) At a point or period (stock)
    Denominator Population at risk or person-time Total population
    Best for Studying causes and risk Describing burden and planning

    Data sources and case ascertainment

    Both measures depend on how reliably cases are identified, a process known as case ascertainment. Cases may be captured through disease registers, routine health records, notification systems for certain conditions, or purpose-designed studies, and each source has its own coverage and biases. Incidence is especially sensitive to the timing and completeness of detection, because it counts new cases within a defined window; if detection is delayed or incomplete, new cases may be missed or assigned to the wrong period. Prevalence is sensitive to whether long-standing cases remain on the source from which counts are drawn. For both measures, a clearly stated and consistently applied case definition is essential, because changes in definition or in how actively cases are sought can move the numbers independently of any real change. This is why epidemiological reporting standards emphasise documenting the data source, the case definition and the ascertainment method together with the measure itself.

    The relationship between them

    Incidence and prevalence are linked, and the link is intuitive once framed as flow and stock. In broad terms, prevalence reflects both how quickly new cases arise (incidence) and how long cases persist (duration). When a condition lasts a long time, even a modest incidence can produce a high prevalence, because cases accumulate faster than they leave the population through recovery or death. When cases resolve quickly, prevalence stays low even if incidence is high, because cases flow out almost as fast as they arrive. This conceptual relationship explains why the two measures can move in different directions: a change that shortens how long cases persist can lower prevalence even while incidence is unchanged or rising. For that reason the two measures must never be used interchangeably.

    Common pitfalls in interpretation

    Because the two measures are so often reported side by side, several errors recur. Treating prevalence as if it indicated risk is a frequent mistake: a high prevalence may reflect that cases persist for a long time rather than that the condition arises frequently, so prevalence alone says little about the chance of developing a condition. Comparing an incidence figure from one study with a prevalence figure from another, as though they were the same quantity, produces meaningless conclusions. A further pitfall is failing to define the population at risk consistently; if people who already have the condition are not excluded from the incidence denominator, the calculated incidence will be understated. Finally, both measures are sensitive to how a case is defined and detected: broadening the case definition or improving detection can raise measured incidence or prevalence without any real change in the underlying occurrence, which is why the case definition should always be reported alongside the figure.

    When to use which

    Use incidence when studying the development of a condition, investigating its causes, or evaluating risk, because it captures the flow of new cases and is the natural measure for cause-and-effect questions. Use prevalence when describing the existing burden, planning services and resources, or characterising how widespread a condition is at a moment in time, because it reflects the total caseload a system must manage. Reporting which measure was used, together with its denominator and time frame, is critical, and reporting guidelines such as STROBE prompt exactly this kind of clarity for observational studies.

    Both measures depend on accurate population denominators, which come from a census or population register, underscoring their place in research data infrastructure. The same denominators underpin death rates. Consistent terminology drawn from the CASRAI dictionary helps keep these definitions stable across studies, and authors can consult the guidance for authors when reporting them.

    Frequently asked questions

    Can incidence be higher than prevalence?

    It can, particularly for conditions that resolve quickly. Because prevalence reflects cases that persist, a condition with short duration may show high incidence but low prevalence, since new cases leave the population almost as fast as they arrive and do not accumulate.

    Why is the denominator different for each?

    Incidence uses the population at risk or person-time, because only those who can newly develop the condition are relevant to counting new cases. Prevalence uses the total population, because it counts all existing cases regardless of when they arose.

    Which measure should a study report?

    It depends on the question. Studies of causation and risk report incidence; studies of burden, planning and service provision report prevalence. The chosen measure, its denominator and its time frame should always be stated explicitly so readers can interpret it correctly.

  • Open Data in Public-Health Research: Standards

    Public health data describe populations and the events that affect them, and increasingly they are expected to be shared so that findings can be verified, combined and reused. Open data means data made available for others to access, use and redistribute under clear terms, while the FAIR principles set out that data should be Findable, Accessible, Interoperable and Reusable. Together they define a standards-based approach to managing research data responsibly, balancing the value of openness against the obligations that attach to data about people.

    These ideas have moved from aspiration to expectation: funders, journals and institutions increasingly ask researchers to describe how their data will be managed and shared. Understanding the standards involved is now part of doing public-health research well.

    FAIR data principles

    The FAIR principles are a widely adopted framework for good data stewardship, and each letter carries a specific meaning. Findable means data carry rich metadata and a persistent identifier so they can be located by both people and machines. Accessible means the data, or at least their metadata, can be retrieved through a standard, open protocol, with any access conditions clearly stated. Interoperable means data use shared vocabularies, standards and formats so they can be combined with other datasets without bespoke translation. Reusable means data are richly described, with clear provenance and an explicit usage licence, so that others can determine whether and how the data may be reused. A crucial point is that FAIR is about machine-actionable description; it does not require data to be fully open, which matters for sensitive public-health data that cannot simply be published. This is a common source of confusion: a dataset held under strict access controls can still be fully FAIR if it is well described, carries a persistent identifier, uses interoperable formats and states clear terms of reuse. The principles describe how data should be prepared and exposed to the world, not whether the data themselves must be freely downloadable by anyone.

    Data-sharing standards

    Responsible sharing rests on a few recurring components that turn the FAIR principles into practice. Datasets are described with structured metadata, assigned persistent identifiers, deposited in trustworthy repositories, and released under explicit licences. These practices map directly onto FAIR and are reinforced across the wider data infrastructure community and the controlled vocabulary in the CASRAI dictionary.

    Element Purpose
    Metadata Describe content, methods and provenance for discovery and reuse
    Persistent identifier Provide a stable, citable reference to the dataset
    Licence State the legal terms under which the data may be reused
    Repository Preserve the data and provide appropriate, controlled access

    De-identification and anonymisation

    Because public-health data can relate to identifiable individuals, sharing must protect privacy before any wider release. At a high level, de-identification removes or transforms information that could identify a person, while anonymisation aims to reduce the risk of re-identification to an acceptable level given the data and the context. The specific techniques, thresholds and acceptable residual risk are governed by law and institutional policy rather than by a single universal rule, and they are applied before data are opened. The key methodological point, framed here as governance rather than clinical guidance, is that openness and privacy protection are managed together as a deliberate design choice, not traded off blindly.

    Interoperability and controlled vocabularies

    The Interoperable element of FAIR depends heavily on shared controlled vocabularies and standard formats, which let data from different sources be combined without ambiguity. When two datasets use the same defined term to mean the same thing, they can be linked and analysed together; when they use different words, or the same word for different concepts, integration becomes error-prone. Public-health data are particularly affected, because the same measure can be defined in subtly different ways across systems, as the distinction between crude and standardised rates illustrates. Adopting recognised vocabularies and classifications, and recording exactly which version was used, turns interoperability from an aspiration into a practical property of the data. This is also where definitional infrastructure earns its keep: a stable, shared set of terms reduces the translation effort every time datasets are combined, and it makes automated reuse far more reliable.

    Metadata and persistent identifiers

    Metadata are what make a dataset interpretable by someone who did not create it: they record what was measured, how, when, over which population and under what definitions, and they capture the provenance of the data. Without good metadata, even openly available data are difficult to reuse correctly. Persistent identifiers, such as a DOI assigned to a dataset, give it a stable address so that it can be cited and tracked over time even if its location changes. Together, rich metadata and persistent identifiers make a dataset Findable and Reusable, and they let derived measures such as incidence and prevalence be traced back to the exact source data that produced them.

    Data management plans and the research lifecycle

    Standards-based sharing is most effective when it is planned from the start rather than bolted on at publication. A data management plan sets out, early in a project, how data will be collected, documented, stored, protected and ultimately shared, and many funders now require one. Building FAIR practice into the plan means deciding in advance which repository will hold the data, what metadata standard will describe it, how identifiers will be assigned, and what governance will control access. This lifecycle view avoids a common failure mode in which data that could have been shared are effectively lost because they were never documented well enough to be understood later. It also makes de-identification and licensing decisions deliberate rather than rushed, and it connects neatly to reproducibility, because a dataset that is well managed throughout its life is far easier for others to verify and reuse.

    Governance and reuse

    Open does not mean unmanaged. Governance frameworks set out who may access data, under what conditions, and for what purposes, balancing the goal of transparency against privacy, consent and legal obligations. For sensitive data, this often means controlled access through an application and approval process rather than unrestricted download, which can still be FAIR. Clear governance, combined with transparent reporting guidelines such as STROBE, supports trustworthy reproducibility and responsible reuse, because it lets others understand both how the data were produced and how they may legitimately be used. Researchers can consult the guidance for authors to align their data-sharing practice with these standards from the start of a project.

    Frequently asked questions

    Does FAIR mean the same thing as open?

    No. FAIR concerns how well data are described and structured for discovery and reuse, including by machines. Data can be FAIR while access remains controlled, which is common for sensitive public-health datasets where full openness is constrained by privacy and legal duties.

    Why are persistent identifiers important?

    A persistent identifier gives a dataset a stable, citable reference that does not break over time even if the data move. It supports findability, enables proper data citation and credit, and lets analyses be traced back to the exact source data used.

    How are privacy and openness reconciled?

    Through de-identification or anonymisation applied before release, combined with governance frameworks that define access conditions. The aim is to maximise responsible reuse while keeping the risk of re-identification within legally and ethically acceptable limits.