K-Means: The maths behind it, how it works and an example
K-means is a clustering unsupervised clustering algorithm that attempts to cluster unlabelled/unidentified data to a number of either predefined or soon-to-be-defined numbers of something called clusters. A cluster is a group of data points which share similar properties, while simultaneously being different from other data cluster(s).
Let’s take a step back though and tackle the concept of what is unsupervised learning. While supervised learning deals with data that clearly has a label/class/regression data, unsupervised machine learning deals with unlabelled data, or data without classes. Let’s take an example of a dataset and see how it can be applicable to either type. So in a supervised machine learning type of data set you would have any number of independent variables and a dependent variable. But in this case there isn’t a dependent variable, otherwise known as a class. To quote Yann LeCunn ““If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake.” So basically you have the data, there is no doubt about that, but it’s unlabeled. We can’t tell if there are two classes or any at all. This exactly is what unsupervised machine learning is for.
An important concept before we dive into how K-means works and such is clustering. Clustering is the concept of grouping a number of data points into a cluster — hence the phrase clustering. This is only an introduction into clustering and there is many more to be said about it but for the present it should suffice.
K-means was thought of first by a Stuart Lloyd at Bell labs as early as in1957, although was mostly internal and was only shown to the public in 1982. Coincidentally Edward W. Forgy has also published the same algorithm in 1965 so sometimes this is referred to as Lloyd-Forgy.
Now let’s go step by step on how K-Means works:
- Pick a number called k, this represents the number of clusters we defined above.
Now how do we pick that magic number then? There are a few ways to try and pick a good k, and then afterward there is a technique which I will talk about later to determine if this is the optimal number of k or not. K is chosen by deciding what is the optimal number of labels one wishes for the data to be grouped in. For example you have a dataset of users you wish to group, and you believe the users in question fit into three groups thus your k now is equal to 3.
2. Calculate the distance between the data points and the new k points.
We calculate the distance between the existing data points from the dataset and the randomly assigned new k data points. Here is the equation for the euclidean distance.
3. Assign each data point to their nearest k point.
After step 2 you would have a list of distances between all the data points and the k points. So each data point will be assigned to the k with the lowest distance calculated.
4. Recalculate the centroids.
Calculate the mean of the new cluster and assign a new centroid to the cluster. Thus the k has basically changed places.
5. Repeat the steps above for a number of iterations.
So now after we have grouped the data points we then do it for a number of iterations each with the new centroid until they don’t change anymore. Now the missing piece of information to fill is how we can recalculate the centroid. One approach is to pre-define the centroids, then set the n_init hyperparameter to 1. The default value of the n_init is 10 (if using the Scikit-Learn library) but can be amended to change the number of iterations the algorithm runs over the dataset. Note: if the algorithm runs for n number of times it will only store the best solution produced.
A metric one could use is something called inertia to give us a score on how well our model is. It is the squared distance between each of the data points and the final and best centroid. The lower the inertia the better model is.
From what we learned about inertia we can go ahead and ascertain the best number of k, or basically the best number of clusters to group the dataset to. The method we would use for that is called the elbow method, where we get the not so creatively named elbow score. What we do is we plot the inertia against the number of clusters. Let’s take an example here and see what I mean.
Now as we can see the plot reaches a point and plateaus, in this case around the y value of 3. That is considered the best value when it comes to choosing the best number of clusters.