Data science & AI · Reference

What is unsupervised learning?

Unsupervised learning is the branch of machine learning that finds structure in data without labelled outputs, discovering patterns such as clusters or a lower-dimensional representation directly from the inputs.

Finding structure without labels

Unsupervised learning is given inputs but no correct answers. Its task is to find structure that is present in the data itself — regularities, groupings, or a more compact representation. Because there is no label acting as a teacher, success is harder to define and to measure than in supervised learning; results are often judged by how useful or interpretable the discovered structure is. The method is especially valuable when labels are expensive or impossible to obtain, which is common with large, raw datasets.

Clustering and dimensionality reduction

The two most common families of unsupervised method are clustering and dimensionality reduction. Clustering groups similar data points together — for example, k-means or hierarchical clustering segmenting observations into natural groups.

Dimensionality reduction compresses data into fewer variables while preserving as much structure as possible; principal component analysis is the classic example. Other unsupervised tasks include anomaly detection and density estimation.

Where unsupervised learning fits

Unsupervised learning sits alongside supervised and reinforcement learning as one of the main paradigms of machine learning. In practice the boundaries blur: self-supervised learning, which generates its own training signal from unlabelled data, underpins much modern pre-training of large models, and semi-supervised methods combine a little labelled data with much unlabelled data. The defining feature of the unsupervised setting remains the absence of explicit, human-provided output labels.

Unsupervised learning in research

In research, unsupervised methods are mainly exploratory: they generate hypotheses, reveal hidden groupings, and reduce complexity before further analysis. Their findings need care — clusters can be artefacts of the chosen algorithm or parameters rather than real structure, and there is no ground truth to validate against directly. Sound practice tests stability across methods and settings, reports the choices made, and treats discovered structure as a hypothesis to be confirmed, not a conclusion.

Key facts

At a glance

Field: subtype of machine learning
Core idea: find structure in unlabelled data
Clustering: groups similar data points
Dimensionality reduction: compresses to fewer variables (e.g. PCA)
No ground-truth labels to evaluate against directly
Useful when labels are scarce or costly

Common questions

FAQ

What is unsupervised learning used for?+

It is used to explore data and find hidden structure: grouping similar items through clustering, compressing data through dimensionality reduction, detecting anomalies, and estimating distributions. It is valuable when labelled data is unavailable or expensive.

How is unsupervised learning evaluated?+

Because there are no labels, there is no direct measure of correctness. Results are judged by internal measures of cluster quality, stability across settings, and how useful or interpretable the discovered structure proves in downstream tasks.

Is unsupervised learning the same as self-supervised learning?+

They are related. Self-supervised learning also uses unlabelled data but creates its own training targets from the data itself, and is widely used to pre-train large models. Both avoid human-provided output labels.

Going deeper

Related on CASRAI

Sources

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.