The clustering problem often rears its head when dealing with enormous data sets of protein configurations. A typical set of MD trajectories can spit out millions of data points and coordinates, and it is essential that the scientist is able to “cluster” them into relevant states.
The k-centers algorithm is what is most frequently used in generating Markov State Models. The idea behind the algorithm is as follows:
Suppose we have N data points, and we ultimately want K centers. We arbitrarily pick one point i from N as the first center. Then we evaluate the distance of all N-1 points to i and store the distances in an array A, and pick out the point j that is furthest away to i. We now set j as the second next cluster center. Again, we re-evaluate the distance between all the N-2 points to j. For each point, if it is closer to j than i, we then re assign it to j and update the array A. We then go through A again and pick out the point furthest away from both i and j, and set it as the next center.
In the case of proteins, the distance function is an RMSD calculation of D dimensions (which can be very computationally expensive).
K-centers essentially tries to minimize the maximum distance between any point and a cluster center. The algorithm has a running time of
, but it is unlikely to find the optimum solution. However, what ever point set our first cluster center to will be at most a factor of 2 from the optimum. For a proof, please refer to:
We frequently use the k-centers routine in MSM-modelling, and quite often, it can take overnight just to cluster one set of data. I am currently working on a GPU-powered k-centers algorithm that should greatly reduce the number of necessary computations.