vector binary cluster-analysis hamming-distance

Fast way of doing k means clustering on binary vectors in c++

I want to cluster binary vectors (millions of them) into k clusters.I am using hamming distance for finding the nearest neighbors to initial clusters (which is very slow as well). I think K-means clustering does not really fit here. The problem is in calculating mean of the nearest neighbors (which are binary vectors) to some initial cluster center, to update the centroid.

A second option is to use K-medoids in which the new cluster center is chosen from one of the nearest neighbors ( the one which is closest to all neighbors for a particular cluster center). But finding that is another problem because numbers of nearest neighbors are also quite large.

Can someone please guide me?

Solution

Indeed k-means is not too appropriate here, because the means won't be reasonable on binary data.

Why do you need exactly k clusters? This will likely mean that some vectors won't fit to their clusters very well.

Some stuff you could look into for clustering: minhash, locality sensitive hashing.