Search code examples
javacluster-analysismahout

mahout Spearmans Correlation java


I'm using mahout KMeansDriver to build clusters, and want to use Spearman as DistanceMeasure.

Can I find this algorithm in java or do I need to write it myself?

I didn't find any examples for that on web.


Solution

  • Do not use k-means with other distance measures.

    It may stop converging.

    K-means is designed to minimize variance. Your distance function must also minimize variance, otherwise you lose the convergence property. For guaranteed convergence with other distances, see partitioning around medoids (PAM) aka k-medoids.

    Correlation measures are a good example of distances that do not work with k-means:

    Consider the two vectors, and absolute spearman correlation: dist=1-|r|

    1 2 3 4 5
    5 4 3 2 1
    

    Obviously, spearman correlation is -1, and these two vectors are considered "identical".

    However, k-means will now compute the mean of these two, which yields the constant vector

    3 3 3 3 3
    

    which is as dis-similar to these two (in fact, it's correlation with anything isn't even well defined). In other words: the mean does not minimize absolute correlation, and you shouldn't use this distance function.

    Variance = squared Euclidean

    This is why you should be using k-means only with squared Euclidean distance.

    On L2 normalized vectors: Variance ~ Cosine

    This is easy to see when looking at the definition of cosine similarity, and the reason why spherical k-means also works.