# Machine Learning: K-Means Clustering

K-means clustering is the first unsupervised learning algorithm in this series. Unsupervised means that the answer is not available to the learning algorithm beforehand, just the cost of a potential solution. To me, unsupervised learning algorithms are more exciting than supervised learning algorithms because they seem to transcend human intelligence in a way. An unsupervised learning algorithm will seek out patterns in data without any (or with few) hints. This seems especially important when, as the human, we don't know what the hints could possibly be.

The Google "Visual Cortex" project shows how powerful unsupervised learning algorithms can be: from millions of unlabeled images, the algorithm found generalized categories such as human faces and cat faces. It is easy to see that if the same thing could be done with an audio stream or a text stream, the streams could be combined at a high enough level for association to produce sounds and text for images, images for text and sounds, and at a high enough level, reasoning.

The K-means clustering algorithm treats data as if it were in clusters centered around some number of points k, one cluster per point. Conceptually, the algorithm picks k centroid points, assigns each point in the data to a cluster based on how close it is to which cluster's centroid, moves each centroid to the center of its cluster, and repeats. The result is a set of centroids which minimizes the distances between each point and its associated cluster's centroid.

The cost function is:

$J = \frac{1}{m}\sum_i^k \sum_{j \in C_i} (x_j - \mu_i)^2$

where Ci is cluster i, and μi is the centroid for cluster i.

There are a few methods for picking the initial centroids. One method, the Forgy method, involves picking k random points from the data set to be the initial centroids. Another method, the Random Partition method, assigns each data point to a random cluster, then produces the initial centroids for each cluster. Regardless of the initial method, the algorithm proceeds by repeating the following two steps:

First, produce the clusters by assigning each data point to one cluster. This means comparing the distance of a point to each centroid, and assigning the point to the cluster whose centroid yields the lowest distance.

Second, calculate the centroid of each resulting cluster.

Repeat these steps until the total cost does not change.

### A Concrete Example

I implemented the above algorithm in Java and ran it on the usual concrete strength data set. As usual, I set aside 20% of the data set as a cross-validation. But a problem quickly became apparent: how many clusters should I use? Clearly the more clusters, the less the overall cost would be simply because there would be more centroids.

One solution is to try different numbers of centroids and ask if there is an obvious point where there is not a lot of improvement in the cost. Here is what I found from k=2 through 10:

Blue is the training cost, while red is the cross-validation cost. Interestingly, the cross-validation cost was always below the training cost, indicating that the cross-validation points represent well the training points. There is clearly no overfitting because there is no large gap between costs. However, there is no obvious point at which increasing the number of clusters doesn't help much.

The other solution to the number of clusters relies on evaluating different numbers of clusters only after later processing. If downstream processing works better with a certain number of clusters, then that number of clusters should be chosen. So, for example, if I put each cluster's data through a neural network, how good is the error for each number of clusters?

I trained an 8:10:10:1 neural network on each cluster of points, so k=2 had 2 networks, and k=10 had 10 networks. I used fewer hidden neurons than before, on the theory that each cluster has less data, meaning that I can probably get away with a smaller parameter space. Here are the results:

Clearly the more clusters and the more networks, the better the output. Perhaps because more networks means smaller clusters, which in turn means less variation to account for. Interestingly, 8 clusters works about as well as 5 clusters, and it's only with 9 and 10 clusters that more advantage is found. In any case, choosing k=10, here are the errors:

Compared to training a single 8:20:20:1 network on the entire data set, clustering has definitely reduced the errors. Most errors in the training set are now under 5% (down from 10% before), and even the one troubling point from before (error 100%) has been knocked down to an error of 83%. The low errors in the cross-validation points -- which, remember, the network has never seen -- all lead us to believe that the networks trained are not overfit.

I would still want to look at those high-error points, perhaps even asking for the experimental data for those points to be rechecked or even rerun. But for now, I would be happy with this artificially intelligent concrete master.

For the next article, I'm going to go off the syllabus of the Machine Learning course, and talk about one of my favorite unsupervised learning algorithms, the autoencoder.