The dataset

The MNIST dataset consists of thousands of images of handwritten digits. (Place the mouse on top of the points in order to see how they look like). The objective is to cluster them by similarity, the previous step for classifying them.

Preprocessing using TensorFlow

Clustering the raw data gives poor results (due to 4’s and 9’s being similar, for example), so we first learn meaningful features, and then cluster the data in feature space. We used the first example from the TensorFlow tutorial. We train a simple one-layer softmax neural network on images from a training set. And we run the trained neural network on the first 1000 elements of the testing set. As a results, each image gets mapped to a vector, representing the probabilities of being each digit. Since the entries of the vector sum to 1, the feature space is actually 9-dimensional. Note that spectral clustering would be useless in this case since all its ‘denoising’ comes from projecting the points into the dimensional space spanned by the top eigenvectors of the Laplacian.

These feature vectors are the ones we represent in our plot. Since 10 is too many dimensions for a plot, we depict the feature vectors projected onto a random plane.

The k-means semidefinite relaxation

Given a set of points , the -means objective is to find a partition in clusters that minimize the sum of the squared distances of the points to the centroid of their respective cluster. In other words,

The k-means SDP is based on the following observation:

Therefore, if is a matrix such that , the -means problem can be written as

where is the indicator vector, a vector such that if and 0 otherwise.

Since optimizing on all possible partitions is NP-hard, we are going to relax the constraint to a convex one. In particular, the semidefinite relaxation:

where means that all entries of are non-negative, and means that is symmetric and positive semidefinite. Now may not represent a partition anymore, so we need a rounding step.

Interpreting the SDP result

Let be the matrix which -th row is the coordinates (in ) of . Note that when for some partition (i.e.: the relaxation is tight), we have that the -th row of the matrix is the coordinates of the corresponding center (i.e.: ).

Let’s arrange the points such that the first points are the zeros, the second points are the ones, and so on. Then, the matrix that corresponds to the ground truth clustering is a block diagonal matrix which -block is constant .

The SDP, however, finds this low rank matrix with many repeated columns and rows instead:

So when we apply it we get:

Bokeh Application

We exploit the fact that has many repeated rows in our rounding step.

Approximation guarantees and tightness

A future post will focus on the mathematical analysis of this algorithm. In particular, the SDP is known to exactly recover the clusters when the points come from a data model called the stochastic ball model (see this and this). And it is known to approximate the centers when the points come from a mixture of subgaussian distributions.

About the interactive plots and clustering implementation

An implementation of the -means SDP is available here. The implementation uses SDPNAL+ which is an amazing SDP solver for problems with non-negative constraints.

The first plot I wrote in javascript using D3. The second interactive plot I wrote in python using Bokeh. If someone asks I can write a tutorial on how to do these plots, but many tutorials and examples are available on the internet already.

Note that the fact that has many repeated columns imply that many points are mapped to the same point after computing . If you place the mouse on top of the points in the second plot you’ll see many images. But if you do it in the first one you’ll see one one (I added a little bit of noise so you can move the mouse and see a different image). That’s because I haven’t figured out how to replicate that behavior in javascript. I’ll be happy to hear suggestions.