I want to cluster binary vectors (millions of them) into k clusters.I am using hamming distance for finding the nearest neighbors to initial clusters (which is very slow as well). I think K-means clustering does not really fit here. The problem is in calculating mean of the nearest neighbors (which are binary vectors) to some initial cluster center, to update the centroid.
A second option is to use K-medoids in which the new cluster center is chosen from one of the nearest neighbors ( the one which is closest to all neighbors for a particular cluster center). But finding that is another problem because numbers of nearest neighbors are also quite large.
Can someone please guide me?
Indeed k-means is not too appropriate here, because the means won't be reasonable on binary data.
Why do you need exactly k
clusters? This will likely mean that some vectors won't fit to their clusters very well.
Some stuff you could look into for clustering: minhash, locality sensitive hashing.