Tag: data analysis

  • Principal Component Analysis (PCA) in Research

    Principal component analysis (PCA) is a statistical technique that reduces the dimensionality of a dataset by transforming a set of possibly correlated variables into a smaller set of uncorrelated variables called principal components. Each principal component is a direction in the data along which variance is maximised, and the components are ordered so that the first captures the most variance, the second the next most, and so on. PCA lets researchers summarise high-dimensional data with fewer variables while retaining as much of the original variation as possible.

    The method traces to Karl Pearson (1901), who framed it geometrically as fitting lines and planes to data, and to Harold Hotelling (1933), who developed it independently in the statistical form widely taught today.

    Principal components as directions of maximum variance

    Imagine a cloud of data points scattered in many dimensions. PCA finds the direction along which the points are most spread out — the direction of maximum variance — and calls it the first principal component. The second principal component is the direction of greatest remaining variance that is orthogonal (perpendicular) to the first, and the process continues. Because the components are orthogonal, they are uncorrelated, which is what makes them a convenient new coordinate system for the data.

    The result is a re-expression of the same data in new axes ordered by importance. Often the first few components account for most of the variance, so the remaining ones can be discarded with little information loss — that is the dimensionality reduction.

    Eigenvectors and eigenvalues, conceptually

    Mathematically, the principal components are the eigenvectors of the data’s covariance (or correlation) matrix, and their associated eigenvalues measure how much variance each component captures. You do not need the linear algebra to grasp the idea: eigenvectors point along the directions of the new axes, and the eigenvalue for each tells you how much of the data’s total variation lies along that axis. Larger eigenvalue means a more important component.

    A practical consequence: PCA is sensitive to the scale of variables. Because it works on variance, a variable measured in large units can dominate purely because its numbers are bigger. For this reason researchers usually standardise variables (mean-centre and scale to unit variance) before applying PCA, which is equivalent to using the correlation matrix.

    Scree plots and choosing components

    A common question is how many components to keep. A scree plot graphs each component’s eigenvalue (or proportion of variance explained) in descending order. Analysts look for an “elbow” where the curve flattens, suggesting that later components add little. Other heuristics include retaining enough components to reach a target cumulative variance (say 80–90%) or keeping components with eigenvalues above a chosen threshold. None is definitive; the choice should be justified and reported.

    Decision Common approach Caution
    Scaling variables Standardise before PCA Skipping it lets large-unit variables dominate
    How many components Scree-plot elbow or cumulative variance Heuristics differ; justify the choice
    Interpretation Inspect component loadings Components need not have a clean real-world meaning

    Proper use and common misuse

    PCA is well suited to exploratory analysis, visualisation, noise reduction and pre-processing before other methods. It is an unsupervised technique — it ignores any outcome labels — which links it to the broader family of supervised versus unsupervised learning methods. It is frequently confused with related but distinct techniques such as factor analysis, which has a different statistical model.

    Misuse is common. PCA captures linear structure only, so it can miss non-linear relationships. Components are defined by variance, not by relevance to a research question, so the first component is not necessarily the “most important” for prediction. Reading substantive meaning into components requires care, and performing PCA on a full dataset before splitting into training and test sets can leak information. Because these choices affect results, documenting the scaling, the number of components and the variance explained is essential for reproducible analysis. Standardised reporting of methods, supported by a shared vocabulary, helps reviewers assess such decisions; our guidance for authors covers documenting analytical steps.

    Frequently asked questions

    What does PCA actually do?

    PCA re-expresses correlated variables as a smaller set of uncorrelated principal components, each a direction of maximum remaining variance. By keeping the first few components, you reduce dimensionality while retaining most of the data’s variation, simplifying visualisation and downstream analysis.

    Do I need to standardise variables before PCA?

    Usually yes. PCA is driven by variance, so variables with large numerical ranges can dominate simply because of their units. Mean-centring and scaling to unit variance (equivalent to using the correlation matrix) prevents this, unless all variables are already on the same meaningful scale.

    How many principal components should I keep?

    There is no single rule. Common approaches are finding the elbow in a scree plot, retaining enough components to explain a target percentage of variance, or applying an eigenvalue threshold. Whatever you choose, report it and the variance explained so others can evaluate it.

    Is PCA a form of machine learning?

    PCA is an unsupervised dimensionality-reduction method and is widely used in machine-learning workflows for pre-processing and visualisation. For the broader context, see our overview of machine learning concepts and methods.

  • Standard Deviation in Research: A Clear Statistical Definition

    Standard deviation is a measure of how spread out a set of values is around its mean. It expresses, in the original units of the data, the typical distance of an observation from the average. A small standard deviation means values cluster tightly around the mean; a large standard deviation means they are widely dispersed. It is one of the most widely reported summary statistics in quantitative research because it captures variability that a mean alone conceals.

    Standard deviation and the mean

    Two datasets can share an identical mean yet behave very differently. Consider two classes whose mean test score is 70. In the first, scores fall between 68 and 72; in the second, they range from 40 to 100. Both means are 70, but the second class is far more variable. The standard deviation quantifies that difference, which is why reporting a mean without a measure of spread is incomplete.

    Standard deviation is the square root of the variance. Variance is the average of the squared deviations of each value from the mean. Squaring removes negative signs and emphasises larger departures, but it also leaves variance in squared units. Taking the square root returns the figure to the original units, making standard deviation the more interpretable companion to the mean.

    Population versus sample

    The formula differs depending on whether the data represent an entire population or a sample drawn from one. The population standard deviation divides the sum of squared deviations by N, the number of values. The sample standard deviation divides by n minus 1 rather than n. This adjustment, known as Bessel’s correction, compensates for the tendency of a sample to underestimate the spread of the population it came from. Because most research analyses a sample and infers something about a wider population, the sample formula with n minus 1 is the one most often applied.

    Quantity Divisor Used when
    Population standard deviation N Every member of the population is measured
    Sample standard deviation n − 1 A sample is used to estimate the population

    The 68-95-99.7 rule

    When data follow a normal (bell-shaped) distribution, standard deviation maps onto predictable proportions of the data. This is the empirical rule, often called the 68-95-99.7 rule. Approximately 68% of values fall within one standard deviation of the mean, about 95% fall within two standard deviations, and roughly 99.7% fall within three. These figures hold only for a normal distribution and are approximations for real data that merely resemble one; skewed or heavy-tailed distributions will not obey them.

    Range from the mean Approximate share of data (normal distribution)
    ±1 standard deviation 68%
    ±2 standard deviations 95%
    ±3 standard deviations 99.7%

    A worked conceptual example

    Suppose adult resting heart rates in a sample have a mean of 70 beats per minute and a standard deviation of 8. If the distribution is roughly normal, then about 68% of people in that sample have a resting rate between 62 and 78 (the mean plus or minus one standard deviation). About 95% fall between 54 and 86 (two standard deviations), and almost everyone, around 99.7%, falls between 46 and 94 (three standard deviations). A reading of 100 would lie more than three standard deviations above the mean and would therefore be unusual relative to this sample. Examining such extreme values links directly to outlier detection, a related step in data quality assessment.

    Standard deviation versus standard error

    A frequent source of confusion is the difference between standard deviation and standard error. Standard deviation describes the variability of individual observations in the data. The standard error of the mean describes the variability of the sample mean itself as an estimate of the population mean, and it equals the standard deviation divided by the square root of the sample size. Because dividing by the root of n shrinks it, the standard error is always smaller than the standard deviation and grows narrower as the sample grows.

    The choice between them depends on what is being communicated. To describe how much individuals differ from one another, report the standard deviation. To express how precisely the mean has been estimated, report the standard error or, more informatively, a confidence interval. Reporting a standard error where a standard deviation is meant can mislead readers into thinking data are far less variable than they are. For practical reporting conventions, see the CASRAI author guidance and the CASRAI dictionary.

    Frequently asked questions

    Why divide by n minus 1 for a sample?

    Dividing by n minus 1 corrects a bias: using the sample mean to centre the data slightly reduces the spread, so dividing by the smaller divisor produces an unbiased estimate of the population variance. This is Bessel’s correction.

    Can standard deviation be negative?

    No. It is a square root of an average of squared quantities, so it is always zero or positive. A standard deviation of zero means every value is identical to the mean.

    Should I report standard deviation or standard error?

    Report the standard deviation to describe variability among observations, and the standard error or a confidence interval to describe the precision of the mean. For wider context on variability and uncertainty, see our guide to confidence intervals and the reproducibility news category.

  • Variance in Statistics: Definition and Formula

    Variance is a measure of how spread out a set of values is, defined as the average of the squared deviations of each value from the mean. A large variance means the data points are widely dispersed; a small variance means they cluster tightly around the mean. Because the deviations are squared, variance is always non-negative and is expressed in squared units of the original measurement.

    The definition of variance

    To calculate variance, you first find the mean of the data, then subtract the mean from each value to get the deviations. Squaring each deviation removes the sign (so positive and negative deviations do not cancel) and gives greater weight to values far from the mean. The average of these squared deviations is the variance.

    Variance is the foundation of many statistical methods, including the analysis of variance (ANOVA), regression diagnostics and the construction of confidence intervals. Reporting it transparently supports the goals set out in our reproducibility coverage.

    Population variance versus sample variance

    The formula depends on whether your data are the entire population or a sample drawn from it. For a population, you divide the sum of squared deviations by the number of values, N. For a sample, you divide by n − 1 instead of n. This adjustment, known as Bessel’s correction, produces an unbiased estimate of the population variance, because using the sample mean slightly underestimates the spread.

    Quantity Symbol Divisor
    Population variance σ² N
    Sample variance n − 1

    A worked conceptual example

    Suppose five replicate measurements give 4, 8, 6, 5 and 2. The mean is (4 + 8 + 6 + 5 + 2) / 5 = 5. The deviations from the mean are −1, 3, 1, 0 and −3. Squaring these gives 1, 9, 1, 0 and 9, which sum to 20. Treating the five values as a population, the variance is 20 / 5 = 4. Treating them as a sample, the variance is 20 / 4 = 5. The sample figure is slightly larger, reflecting Bessel’s correction.

    Variance and the standard deviation

    Variance and the standard deviation describe the same property of spread, but in different units. The standard deviation is simply the square root of the variance, which returns the measure to the original units of the data. In our worked example the population standard deviation is √4 = 2. Because the standard deviation is easier to interpret alongside the mean, it is often reported in papers; see our companion piece on the standard deviation for detail. Variance, however, has convenient mathematical properties, which is why it underlies so many statistical procedures.

    Interpreting variance correctly

    Because variance is in squared units, its absolute size is hard to interpret in isolation. A variance of 4 cm² is meaningful only relative to the scale of the measurement. Variance is also sensitive to outliers: squaring magnifies the effect of extreme values, so a single anomalous point can inflate the variance substantially. Always inspect your data distribution before reporting variance, and define the term consistently in your methods. The CASRAI dictionary and our author guidance encourage precise, reproducible statistical reporting.

    Frequently asked questions

    Why is variance squared rather than absolute?

    Squaring the deviations keeps the measure mathematically tractable and differentiable, which makes it the natural basis for least squares estimation and many other techniques. The absolute deviation is an alternative but lacks these convenient properties.

    When should I divide by n − 1 instead of n?

    Divide by n − 1 whenever your data are a sample used to estimate the variance of a wider population. Divide by N only when your data genuinely represent the entire population of interest.

    Is a high variance bad?

    Not inherently. High variance simply means greater spread. Whether that is good or bad depends on context: high variance in measurement error is undesirable, but natural biological variation may be expected and informative.