Data science & AI · Reference

What is principal component analysis?

Principal component analysis is a statistical technique that reduces the dimensionality of a dataset by transforming its variables into a smaller set of uncorrelated components ordered by how much variance they capture.

What PCA does

PCA finds new axes — the principal components — that are linear combinations of the original variables. The first component points in the direction of greatest variance in the data; the second captures the most remaining variance while being orthogonal (uncorrelated) to the first; and so on. By keeping only the first few components, an analyst can represent high-dimensional data in two or three dimensions while preserving most of its variation. PCA is an unsupervised technique: it uses the structure of the data itself, not any labels.

How it works

Mathematically, PCA is computed from the covariance (or correlation) matrix of the data. Its eigenvectors give the directions of the principal components, and the corresponding eigenvalues give the amount of variance each explains. Equivalently, it can be computed via the singular value decomposition.

Because PCA is sensitive to the scale of variables, data is usually standardised first. The technique was introduced by Karl Pearson in 1901 and developed by Harold Hotelling in 1933.

What PCA is used for

PCA is widely used to visualise high-dimensional data, to compress it, to reduce noise, and to remove redundancy before further modelling. In research it helps reveal the dominant patterns of variation in a dataset — for example, separating samples by their largest sources of difference. It is a standard exploratory tool across genomics, image analysis, finance research, and the social sciences.

Interpretation and limitations

PCA components are mathematical constructs, not necessarily meaningful real-world quantities, so interpreting them requires care. The method captures only linear structure and is sensitive to scaling and outliers; non-linear methods exist for data whose structure linear components cannot capture. Choosing how many components to retain — often guided by the proportion of variance explained — is a judgement call that should be reported, since discarding components inevitably loses some information.

Key facts

At a glance

Definition: dimensionality reduction via uncorrelated components
Components: ordered by variance explained
Orthogonality: components are mutually uncorrelated
Computed from: covariance matrix (eigen-decomposition / SVD)
Introduced: Karl Pearson, 1901; Harold Hotelling, 1933
Type: unsupervised, linear technique

Common questions

FAQ

What is PCA used for?+

PCA is used to reduce the number of dimensions in a dataset while keeping most of its variation. This helps with visualising high-dimensional data, compressing it, reducing noise, and removing redundancy before further analysis or modelling.

What is a principal component?+

A principal component is a new variable formed as a linear combination of the original variables, chosen to capture as much variance as possible while being uncorrelated with the other components. The first component captures the most variance, the second the next most, and so on.

What are the limitations of PCA?+

PCA captures only linear structure, is sensitive to scaling and outliers, and produces components that may not have a clear real-world meaning. For data with non-linear structure, other dimensionality-reduction methods may be more appropriate.

Going deeper

Related on CASRAI

Sources

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.