Search code examples
cluster-analysismahoutk-means

Kmeans clustering using mahout


I am trying to perform kmeans algorithm on data using . The option that has to be passed while running need a path to initial clusters. Can anyone tell me how can we have initial clusters even before starting the algorithm?

bin/mahout kmeans \
    -i <input vectors directory> \
    -c <input clusters directory> \
    -o <output working directory> \
    -k <optional number of initial clusters to sample from input vectors> \
    -dm <DistanceMeasure> \
    -x <maximum number of iterations> \
    -cd <optional convergence delta. Default is 0.5> \
    -ow <overwrite output directory if present>
    -cl <run input vector clustering after computing Canopies>
    -xm <execution method: sequential or mapreduce>

Solution

  • A) Mahout is slooooow. If your data fits into main memory, use other tools such as ELKI. They outperformed Mahout for me by far. If your data doesn't fit into main memory: are you sure k-means makes any sense on your data anyway? There is no point in doing a computation that doesn't solve your problem. Start with a sample to first check if it works at all, then scale up. Mahout is a last resort choice: if you absolutely need this to be computed on all your data, and everything else failed, then use Mahout.

    B) Read all the documentation... next line in the documentation of Mahout k-means says:

    Note: if the -k argument is supplied, any clusters in the -c directory will be overwritten and -k random points will be sampled from the input vectors to become the initial cluster centers.

    In other words: if you know the initial cluster centers, supply them via -c and do not set -k. Otherwise an empty -c folder is okay, if you provide -k, the number of cluster centers to sample.