Search code examples
c++large-data

How to perform k-means on data larger than RAM?


I have implemented the k-means algorithm to cluster data. The data I'm working with can be larger than the amount of RAM I have available. Is there a common way (in C++) to handle these kind of problems?


Solution

  • There are incremental k-means algorithms out there : http://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf

    C++ source code : https://code.google.com/p/sofia-ml/