Data science & AI · Reference

What is clustering?

Clustering is an unsupervised machine-learning task that groups a set of data points so that those in the same group are more similar to each other than to those in other groups, revealing natural structure without labels.

Grouping by similarity

Clustering takes a set of data points and a notion of similarity (or distance) between them, and arranges the points into groups so that members of a group are alike and different groups are distinct. It is a core unsupervised learning task: there are no labels telling the algorithm what the correct groups are, so it must find structure that is intrinsic to the data. The choice of distance measure — Euclidean distance, cosine similarity, and others — strongly shapes the clusters that emerge.

Families of clustering method

Clustering methods differ in how they define a cluster. Partitional methods such as k-means divide the data into a set number of non-overlapping groups around central points.

Hierarchical methods build a nested tree of clusters, which can be cut at any level to give a chosen number of groups. Density-based methods, such as DBSCAN, define clusters as dense regions separated by sparse ones, and can find irregularly shaped clusters and label outliers as noise.

Choosing and judging clusters

A recurring difficulty is deciding how many clusters there are, since the "right" number is often not obvious and depends on the purpose. Various heuristics and internal measures of cluster quality help, but none is definitive. A deeper challenge is that clustering will always return groups, whether or not meaningful structure exists; different algorithms or parameters can produce quite different clusterings of the same data. Results are therefore best treated as hypotheses about structure rather than definitive groupings.

Clustering in research

Clustering is widely used in research to explore data, segment populations, and generate hypotheses — for example grouping genes, documents, or survey respondents. Because there is no ground truth, findings need careful validation: testing stability across methods and parameters, checking that clusters are interpretable, and confirming them against independent evidence. Reporting the algorithm, distance measure, and number of clusters chosen is essential for the analysis to be reproducible.

Key facts

At a glance

Definition: grouping similar data points without labels
Type: unsupervised learning task
Depends on: a similarity or distance measure
Partitional methods: e.g. k-means
Hierarchical methods: build a nested tree of clusters
Density-based methods: e.g. DBSCAN (find irregular clusters, noise)

Common questions

FAQ

What is clustering used for?+

Clustering is used to discover natural groupings in unlabelled data — segmenting customers, grouping documents or genes, detecting anomalies, and exploring structure before further analysis. It is a common first step in understanding a new dataset.

What are the main types of clustering?+

The main families are partitional methods such as k-means, hierarchical methods that build a tree of nested clusters, and density-based methods such as DBSCAN that find dense regions and can detect outliers. They differ in how they define a cluster.

How do you decide the number of clusters?+

There is no single correct answer; the number depends on the data and purpose. Heuristics and internal quality measures can guide the choice, but results should be checked for stability and interpretability rather than taken as definitive.

Going deeper

Related on CASRAI

Sources

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.