MLFSMLFS

Part 2 · Chapter 09

Clustering: Group Therapy for Data

K-Means, the elbow method, finding cliques in chaos.

So far, we've lived in the cozy, well-lit world of supervised learning. We've always had an answer key. Our data was neatly labeled, and our job was to teach a model to predict those labels.

Now, we're turning off the lights.

Welcome to unsupervised learning. Here, we have no labels. No right answers. We just have a chaotic mess of data points, and our job is to find the natural structure within them. We're not predicting; we're discovering.

The most fundamental task in unsupervised learning is clustering: finding the natural "cliques" or "groups" in your data. And the king of clustering algorithms is K-Means.

The Goal: Finding the Cliques

Imagine you're a high school principal looking at a map of where all your students hang out during lunch. You don't have pre-defined labels like "jocks" or "nerds." You just see a scatter plot of students. Your goal is to identify the main social groups. You'd probably circle the dense clumps of students, right?

That's exactly what K-Means does. It's an algorithm designed to partition your data into a pre-defined number (k) of clusters.

The K-Means Algorithm: A Ghosting Story

The K-Means algorithm is a beautiful, iterative dance. It's like forcing your data points into group therapy sessions until they find where they truly belong.

  • Step 1: Initialize (The Awkward Mixer). You decide you want to find k groups (let's say k=3). To start, you pluck k random data points from your dataset and call them centroids.
  • Step 2: Assign (Find Your Clique). Go through every single data point. For each point, calculate its distance to all three centroids. Assign the data point to the group of the centroid it's closest to.
  • Step 3: Update (The Centroid Ghosts). Now, the centroids look at the new group of friends they've attracted. They say, "I should be in the middle of my new friend group." So, each centroid calculates the mean (the average position) of all the data points in its cluster and moves there.
  • Step 4: Repeat. Now that the centroids have moved, some data points might be closer to a different centroid. So, you repeat Step 2, then Step 3.

You keep repeating this Assign-Update loop until the centroids stop moving. When a full loop happens and no data points change their cluster assignment, the algorithm has converged. The groups are stable. The therapy session is over.

Live · k-means dance · iter 0
k (number of clusters)3

From-Scratch K-Means: Let's Make Some Clusters

The core of K-Means is a loop that contains our two main steps.

kmeans.py

Visualizing this process is incredibly satisfying. You can plot the data points and watch the centroids dance around with each iteration until they settle into the heart of their clusters.

The Elbow Method: Finding the Sweet Spot (Unlike Your Ex)

There's a huge question we've been ignoring: how do you pick k? Why 3 clusters and not 2, or 10?

This is a common problem in unsupervised learning. Since there's no "right" answer, we have to use a heuristic. The most popular one is the Elbow Method.

Analogy: The Law of Diminishing Returns

Imagine you're cleaning a messy room.

  • Adding one trash can (k=1) helps a lot. Huge improvement.
  • Adding a second trash can (k=2) for recycling also helps a lot.
  • Adding a third (k=3) might still be useful.
  • Adding a tenth (k=10)? At this point, you're just adding more trash cans for the sake of it.

The Elbow Method does the same for clusters. We calculate a score called the Within-Cluster Sum of Squares (WCSS). This is just the sum of the squared distances between each point and its centroid. It's a measure of how tight and compact the clusters are. A lower WCSS is better.

We run K-Means for a range of k values and plot the WCSS for each. The graph will look like an arm bending downwards.

The "elbow" of the arm is the point where the WCSS stops decreasing so rapidly. It's the point of diminishing returns, where adding another cluster doesn't give you much bang for your buck. This is our best guess for the optimal value of k.

It's important to recognize that the random initialization of centroids can be a significant weakness. A bad start can cause the algorithm to converge to a local minimum—a solution that is good, but not the best possible one. This is the same "stuck in a shallow valley" problem we saw with gradient descent. To combat this, standard implementations of K-Means (like the one in sklearn) run the entire algorithm multiple times with different random starting points (n_init) and then pick the best final result.