Tag: dimensionality reduction

  • Principal Component Analysis (PCA) in Research

    Principal component analysis (PCA) is a statistical technique that reduces the dimensionality of a dataset by transforming a set of possibly correlated variables into a smaller set of uncorrelated variables called principal components. Each principal component is a direction in the data along which variance is maximised, and the components are ordered so that the first captures the most variance, the second the next most, and so on. PCA lets researchers summarise high-dimensional data with fewer variables while retaining as much of the original variation as possible.

    The method traces to Karl Pearson (1901), who framed it geometrically as fitting lines and planes to data, and to Harold Hotelling (1933), who developed it independently in the statistical form widely taught today.

    Principal components as directions of maximum variance

    Imagine a cloud of data points scattered in many dimensions. PCA finds the direction along which the points are most spread out — the direction of maximum variance — and calls it the first principal component. The second principal component is the direction of greatest remaining variance that is orthogonal (perpendicular) to the first, and the process continues. Because the components are orthogonal, they are uncorrelated, which is what makes them a convenient new coordinate system for the data.

    The result is a re-expression of the same data in new axes ordered by importance. Often the first few components account for most of the variance, so the remaining ones can be discarded with little information loss — that is the dimensionality reduction.

    Eigenvectors and eigenvalues, conceptually

    Mathematically, the principal components are the eigenvectors of the data’s covariance (or correlation) matrix, and their associated eigenvalues measure how much variance each component captures. You do not need the linear algebra to grasp the idea: eigenvectors point along the directions of the new axes, and the eigenvalue for each tells you how much of the data’s total variation lies along that axis. Larger eigenvalue means a more important component.

    A practical consequence: PCA is sensitive to the scale of variables. Because it works on variance, a variable measured in large units can dominate purely because its numbers are bigger. For this reason researchers usually standardise variables (mean-centre and scale to unit variance) before applying PCA, which is equivalent to using the correlation matrix.

    Scree plots and choosing components

    A common question is how many components to keep. A scree plot graphs each component’s eigenvalue (or proportion of variance explained) in descending order. Analysts look for an “elbow” where the curve flattens, suggesting that later components add little. Other heuristics include retaining enough components to reach a target cumulative variance (say 80–90%) or keeping components with eigenvalues above a chosen threshold. None is definitive; the choice should be justified and reported.

    Decision Common approach Caution
    Scaling variables Standardise before PCA Skipping it lets large-unit variables dominate
    How many components Scree-plot elbow or cumulative variance Heuristics differ; justify the choice
    Interpretation Inspect component loadings Components need not have a clean real-world meaning

    Proper use and common misuse

    PCA is well suited to exploratory analysis, visualisation, noise reduction and pre-processing before other methods. It is an unsupervised technique — it ignores any outcome labels — which links it to the broader family of supervised versus unsupervised learning methods. It is frequently confused with related but distinct techniques such as factor analysis, which has a different statistical model.

    Misuse is common. PCA captures linear structure only, so it can miss non-linear relationships. Components are defined by variance, not by relevance to a research question, so the first component is not necessarily the “most important” for prediction. Reading substantive meaning into components requires care, and performing PCA on a full dataset before splitting into training and test sets can leak information. Because these choices affect results, documenting the scaling, the number of components and the variance explained is essential for reproducible analysis. Standardised reporting of methods, supported by a shared vocabulary, helps reviewers assess such decisions; our guidance for authors covers documenting analytical steps.

    Frequently asked questions

    What does PCA actually do?

    PCA re-expresses correlated variables as a smaller set of uncorrelated principal components, each a direction of maximum remaining variance. By keeping the first few components, you reduce dimensionality while retaining most of the data’s variation, simplifying visualisation and downstream analysis.

    Do I need to standardise variables before PCA?

    Usually yes. PCA is driven by variance, so variables with large numerical ranges can dominate simply because of their units. Mean-centring and scaling to unit variance (equivalent to using the correlation matrix) prevents this, unless all variables are already on the same meaningful scale.

    How many principal components should I keep?

    There is no single rule. Common approaches are finding the elbow in a scree plot, retaining enough components to explain a target percentage of variance, or applying an eigenvalue threshold. Whatever you choose, report it and the variance explained so others can evaluate it.

    Is PCA a form of machine learning?

    PCA is an unsupervised dimensionality-reduction method and is widely used in machine-learning workflows for pre-processing and visualisation. For the broader context, see our overview of machine learning concepts and methods.