Search code examples
vectorbinarycluster-analysishamming-distance

Fast way of doing k means clustering on binary vectors in c++


I want to cluster binary vectors (millions of them) into k clusters.I am using hamming distance for finding the nearest neighbors to initial clusters (which is very slow as well). I think K-means clustering does not really fit here. The problem is in calculating mean of the nearest neighbors (which are binary vectors) to some initial cluster center, to update the centroid.

A second option is to use K-medoids in which the new cluster center is chosen from one of the nearest neighbors ( the one which is closest to all neighbors for a particular cluster center). But finding that is another problem because numbers of nearest neighbors are also quite large.

Can someone please guide me?


Solution

  • Indeed k-means is not too appropriate here, because the means won't be reasonable on binary data.

    Why do you need exactly k clusters? This will likely mean that some vectors won't fit to their clusters very well.

    Some stuff you could look into for clustering: minhash, locality sensitive hashing.