Search code examples
tensorflowmachine-learningpysparkcluster-analysisk-means

Faster Kmeans Clustering on High-dimensional Data with GPU Support


We've been using Kmeans for clustering our logs. A typical dataset has 10 mill. samples with 100k+ features.

To find the optimal k - we run multiple Kmeans in parallel and pick the one with the best silhouette score. In 90% of the cases we end up with k between 2 and 100. Currently, we are using scikit-learn Kmeans. For such a dataset, clustering takes around 24h on ec2 instance with 32 cores and 244 RAM.

I've been currently researching for a faster solution.

What I have already tested:

  1. Kmeans + Mean Shift Combination - a little better (for k=1024 --> ~13h) but still slow.

  2. Kmcuda library - doesn't have support for sparse matrix representation. It would require ~3TB RAM to represent that dataset as a dense matrix in memory.

  3. Tensorflow (tf.contrib.factorization.python.ops.KmeansClustering()) - only started investigation today, but either I am doing something wrong, or I do not know how to cook it. On my first test with 20k samples and 500 features, clustering on a single GPU is slower than on CPU in 1 thread.

  4. Facebook FAISS - no support for sparse representation.

There is PySpark MlLib Kmeans next on my list. But would it make sense on 1 node?

Would it be training for my use-case faster on multiple GPUs? e.g., TensorFlow with 8 Tesla V-100?

Is there any magical library that I haven't heard of?

Or just simply scale vertically?


Solution

  • thanks to @desertnaut for his suggestion with RAPIDS cuml library.

    The follow up can be found here.