Search code examples
cluster-analysismahoutk-means

Using Mahout for clustering one point


I know that Mahout is used for batch processing, but I am interested if I can use its KMeans, and how, for clustering individual points?

Let's say that we have following situation

  • Global clustering, that performs batch processing on all data and gives centroids as result
  • One point clustering, that uses centroids from global clustering, to assign that point to a cluster - it does not require cluster centroid re-computation - just assigning that point to an existing cluster

Can I do this using Mahout, or I have to implement it myself? I thought setting number of iterations to 1, and in that way assign the point, but the thing is, KMeans recomputes cluster centroids and if that new point is an outlier, it makes a new cluster from it. I don't want that, I actually want the distance to closest centroid.

For now, it seems that it is not very appropriate to use KMeans for this, but it should be implemented separately... Is that correct?

Thanks


Solution

  • You don't need to use Mahout for this.

    K-means assigns points to the nearest center.

    So just get all centers (which should fit easily into RAM), and compute the least-squares difference to each center.

    It's just a few CPU cycles, there is absolutely no benefit in trying to do this on Mahout - the overhead will be much too large for just some k distance computations.