K-Means & Clustering

Discover hidden groups in unlabelled data. Learn how k-means iterates to convergence, how to choose k, and when DBSCAN or hierarchical clustering is the better tool.

Learn K-Means & Clustering in our free AI & Machine Learning course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a…

Part of the free AI & Machine Learning course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.

Imagine guests arriving at a party with no name tags. You want to seat people who get along together. You place a few empty tables (the centroids ), send each guest to the nearest table, then nudge each table toward the middle of the guests who chose it. Repeat, and the tables drift to the natural social groups.

That is exactly k-means: assign each point to its nearest centroid, move each centroid to the average of its points, and repeat until nobody changes tables.

In unsupervised learning there are no labels — only features. The goal is to find structure. Clustering is the most common task: group points so that members of a group are similar to each other and different from other groups.

k-means is the classic clustering algorithm. You pick k , the number of clusters, then repeat two steps until convergence:

The smart k-means++ initialisation spreads the starting centroids far apart, which avoids bad local optima you can get from naive random seeds.

Let's run a single k-means iteration by hand: assign each point to its nearest centroid, then move each centroid to the mean of its points. Watch how the centroids jump toward the data.

The hardest part of k-means is choosing k . Two standard tools help:

Here's k-means with scikit-learn's KMeans , after scaling. Study it (it isn't runnable in the in-browser sandbox). Notice there is no y — only features — and that silhouette_score gives you a quality number to compare different k.

Cluster label numbers are arbitrary — only the grouping matters, so the same clusters might be labelled 0/1 or 1/0 on different runs.

k-means assumes round, similarly-sized blobs and forces you to pick k. Two alternatives lift those limits:

Fill in the blank so nearest() returns the index of the closest centroid. The expected output is in the comments.

A centroid is just the mean position of its points. Fill in the blank to compute it.

Given pretend inertia values for several k, compute the drop at each step and print where the elbow is. Only a comment outline is provided.

These are the classic clustering pitfalls. Watch for them.

An unscaled large-range feature dominates the Euclidean distance and warps the clusters.

Inertia always falls as k rises, so this just pushes k toward the number of rows.

✅ Fix: use the elbow or the silhouette score:

k-means assumes round blobs, so it splits rings, crescents, and elongated shapes badly.

✅ Fix: use DBSCAN (or spectral clustering) for odd shapes:

You can now run k-means by hand, choose k with the elbow and silhouette, and reach for DBSCAN or hierarchical clustering when the clusters aren't simple blobs.

🚀 Up next: Dimensionality Reduction with PCA — compress many features into a few while keeping the signal.

Practice quiz

What makes clustering an 'unsupervised' learning task?

  • It uses labelled data
  • It groups data without any labels
  • It always needs a test set
  • It predicts a continuous value

Answer: It groups data without any labels. Clustering finds structure in unlabelled data — there are no target labels to learn from, just the features.

What does the 'k' in k-means represent?

  • The number of features
  • The number of clusters you choose in advance
  • The number of iterations
  • The learning rate

Answer: The number of clusters you choose in advance. k is the number of clusters; you must pick it before running k-means, which is one of its main limitations.

What does one iteration of the k-means algorithm do?

  • Assign points to nearest centroid, then move each centroid to its points' mean
  • Build a decision tree
  • Compute a kernel matrix
  • Sort the data

Answer: Assign points to nearest centroid, then move each centroid to its points' mean. k-means alternates two steps: assign each point to the nearest centroid, then recompute each centroid as the mean of its points.

What is the 'elbow method' used for?

  • Scaling features
  • Choosing a good value of k
  • Removing outliers
  • Initialising centroids

Answer: Choosing a good value of k. The elbow method plots within-cluster error against k and looks for the 'elbow' where adding clusters stops helping much.

What does the silhouette score measure?

  • Training time
  • The number of features
  • How well each point fits its cluster vs the nearest other cluster
  • The depth of a tree

Answer: How well each point fits its cluster vs the nearest other cluster. Silhouette compares how close a point is to its own cluster versus the nearest neighbour cluster; higher (near 1) is better.

What is a key advantage of DBSCAN over k-means?

  • It requires choosing k in advance
  • It can find arbitrarily-shaped clusters and label noise as outliers
  • It is always faster
  • It needs labelled data

Answer: It can find arbitrarily-shaped clusters and label noise as outliers. DBSCAN groups dense regions, finds non-spherical clusters, marks sparse points as noise, and does not need k up front.

Why should you usually scale features before k-means?

  • k-means ignores scale
  • Distances dominate, so a large-range feature would control the clusters
  • Scaling chooses k for you
  • Only DBSCAN needs scaling

Answer: Distances dominate, so a large-range feature would control the clusters. k-means uses Euclidean distance, so an unscaled large-range feature dominates; standardise so features contribute fairly.

What does hierarchical (agglomerative) clustering produce?

  • A single flat partition only
  • A regression line
  • A set of support vectors
  • A tree of nested clusters (dendrogram) you can cut at any level

Answer: A tree of nested clusters (dendrogram) you can cut at any level. Agglomerative clustering merges the closest groups step by step, building a dendrogram you can cut to get any number of clusters.

A known weakness of basic k-means is that it assumes clusters are:

  • Roughly spherical and similar in size
  • Always non-convex
  • Always overlapping
  • Labelled

Answer: Roughly spherical and similar in size. k-means uses distance to a centroid, so it favours round, similarly-sized blobs and struggles with elongated or nested shapes.

What are k-means centroids initialised as, in the smarter k-means++ scheme?

  • Always the first k rows
  • Spread-out points chosen to be far apart
  • The global mean repeated k times
  • Random labels

Answer: Spread-out points chosen to be far apart. k-means++ seeds centroids that are spread apart, which speeds convergence and avoids poor local optima from naive random starts.