Data science & AI · Reference

What is k-means clustering?

K-means clustering is a popular unsupervised algorithm that partitions data into k groups by repeatedly assigning each point to its nearest cluster centre and then recomputing those centres until the assignment stabilises.

The k-means algorithm

K-means is the best-known clustering method. The user fixes the number of clusters, k, in advance. The algorithm then places k centroids, often at random, and alternates two steps: an assignment step, which assigns each point to its nearest centroid, and an update step, which moves each centroid to the mean position of the points assigned to it. These steps repeat until assignments no longer change. The procedure seeks to minimise the total squared distance between points and the centre of their cluster.

Choosing k and starting points

Two choices strongly affect the outcome. The number of clusters k must be set beforehand; heuristics such as the "elbow method", which looks for diminishing returns as k increases, can guide it, but no rule is definitive.

The initial placement of centroids also matters, because k-means can settle into a poor local solution. Improved initialisation schemes and running the algorithm several times from different starts, then keeping the best result, are standard ways to mitigate this.

Strengths and limitations

K-means is popular because it is simple, fast, and scales well to large datasets. Its limitations follow from its assumptions: it tends to find roughly spherical, similarly sized clusters and uses straight-line distance, so it struggles with elongated or irregular shapes and is sensitive to outliers and to feature scaling. It also requires k to be chosen in advance. Where these assumptions do not hold, density-based or hierarchical methods may suit the data better, and features are often standardised before clustering.

K-means in research

K-means is a common first tool for segmenting data and exploring structure, valued for its speed and simplicity. Because its results depend on k, the initialisation, and feature scaling, sound practice reports all three and checks that clusters are stable across runs and interpretable in context. As an unsupervised method it produces groupings, not ground truth, so clusters should be validated rather than assumed meaningful.

Key facts

At a glance

Definition: partitions data into k clusters by nearest centroid
Type: unsupervised, partitional clustering
k: number of clusters, chosen in advance
Iterates: assignment step then centroid-update step
Objective: minimise within-cluster squared distance
Assumes: roughly spherical, similar-sized clusters

Common questions

FAQ

How does k-means clustering work?+

It places k cluster centres, assigns each data point to the nearest centre, then moves each centre to the mean of its assigned points. These two steps repeat until the assignments stop changing, minimising the total squared distance within clusters.

How do you choose the value of k?+

The number of clusters k is set in advance. Heuristics such as the elbow method, which looks for the point of diminishing returns as k grows, can help, but the choice ultimately depends on the data and the purpose of the analysis.

What are the limitations of k-means?+

K-means assumes roughly spherical, similarly sized clusters and uses straight-line distance, so it struggles with irregular shapes, is sensitive to outliers and feature scaling, and requires k to be chosen beforehand. Other methods may suit such cases better.

Going deeper

Related on CASRAI

Sources

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.