K-centers and clustering

The clustering problem often rears its head when dealing with enormous data sets of protein configurations. A typical set of MD trajectories can spit out millions of data points and coordinates, and it is essential that the scientist is able to “cluster” them into relevant states.

The k-centers algorithm is what is most frequently used in generating Markov State Models. The idea behind the algorithm is as follows:

Suppose we have N data points, and we ultimately want K centers. We arbitrarily pick one point i from N as the first center. Then we evaluate the distance of all N-1 points to i and store the distances in an array A, and pick out the point j that is furthest away to i. We now set j as the second next cluster center. Again, we re-evaluate the distance between all the N-2 points to j. For each point, if it is closer to j than i, we then re assign it to j and update the array A. We then go through A again and pick out the point furthest away from both i and j, and set it as the next center.

In the case of proteins, the distance function is an RMSD calculation of D dimensions (which can be very computationally expensive).

K-centers essentially tries to minimize the maximum distance between any point and a cluster center. The algorithm has a running time of O(n*k), but it is unlikely to find the optimum solution. However, what ever point set our first cluster center to will be at most a factor of 2 from the optimum. For a proof, please refer to:

We frequently use the k-centers routine in MSM-modelling, and quite often, it can take overnight just to cluster one set of data. I am currently working on a GPU-powered k-centers algorithm that should greatly reduce the number of necessary computations.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>