Definition · Plain-language

Missing data

Missing data occurs when no value is stored for a variable in an observation, requiring researchers to diagnose the cause and apply appropriate handling methods.

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.

The three mechanisms of missingness

To address missing data, researchers must understand why the values are missing. Missing Completely at Random (MCAR) means the missingness is unrelated to any data in the set. Missing at Random (MAR) occurs when the probability of missingness is related to other observed variables, but not the missing values themselves. Missing Not at Random (MNAR) means the missingness is directly related to the value of the missing variable itself, which can introduce severe bias. Diagnosing these patterns using exploratory data analysis is essential for choosing the correct statistical handling method to avoid invalid results. Identifying whether data points are missing randomly or systematically determines the validity of all subsequent statistical inferences drawn from the model.

Traditional deletion methods

Historically, researchers used deletion methods to handle missing values. Listwise deletion (complete-case analysis) removes any participant with a missing value on any variable. While simple, this reduces sample size and statistical power, and can introduce bias if the data is not MCAR. Pairwise deletion keeps observations for analyses where the specific variables are present, but this can lead to mathematically inconsistent correlation matrices. Because of these limitations, deletion methods are generally discouraged in modern research unless the proportion of missing data is extremely small, typically under five percent of the overall sample. Therefore, researchers must carefully report the percentage of missing values and justify the chosen deletion strategy in their published methodology.

Modern imputation techniques

Modern statistical practice favours imputation, which replaces missing values with estimated numbers. Single imputation replaces gaps with a mean or predicted value, but underestimates uncertainty. Multiple Imputation (MI) addresses this by creating several filled-in datasets, analysing them separately, and combining the results. This preserves sample size and accounts for the uncertainty of the missing values, provided the data is MAR. Researchers use packages like MICE in R or Scikit-learn in Python to implement these techniques, ensuring that their statistical models reflect the true uncertainty of the dataset under study. By accounting for this missingness variance, multiple imputation prevents researchers from overestimating the precision of their statistical findings during analysis.

Key facts

At a glance

Refers to the absence of recorded data points in a research dataset
Categorised into MCAR, MAR, and MNAR statistical mechanisms
Listwise deletion removes entire cases, reducing sample size and power
Single imputation replaces gaps but underestimates statistical uncertainty
Multiple Imputation (MI) is the gold standard for handling MAR data
Incorrect handling of missing values can introduce significant systematic bias

Common misconceptions

What people often get wrong

Often heard: Missing Completely at Random means that the missing data is not a problem at all.

Actually: Even if data is MCAR, missingness still reduces your sample size and statistical power, making it harder to detect real effects in your analysis.

Often heard: Replacing all missing values with the mean of the column is always a safe approach.

Actually: Mean imputation reduces the variance of the variable, distorts relationships with other variables, and underrepresents the true uncertainty of the dataset.

Common questions

FAQ

What is the risk of Missing Not at Random (MNAR) data?+

MNAR data is highly problematic because the reason the data is missing is tied to the values themselves (e.g., participants with high incomes refusing to report them). This creates systematic bias that standard imputation methods cannot easily correct without complex modelling.

When is listwise deletion acceptable?+

Listwise deletion is generally acceptable only when the proportion of missing data is very small (typically under 5%) and the missingness is confirmed to be MCAR, as it is unlikely to introduce meaningful bias under these conditions.

Going deeper

Related CASRAI guidance

Data cleaning →Jupyter Notebook →NVivo →Methodology section →