I'm using mahout KMeansDriver
to build clusters, and want to use Spearman
as DistanceMeasure
.
Can I find this algorithm in java or do I need to write it myself?
I didn't find any examples for that on web.
K-means is designed to minimize variance. Your distance function must also minimize variance, otherwise you lose the convergence property. For guaranteed convergence with other distances, see partitioning around medoids (PAM) aka k-medoids.
Correlation measures are a good example of distances that do not work with k-means:
Consider the two vectors, and absolute spearman correlation: dist=1-|r|
1 2 3 4 5
5 4 3 2 1
Obviously, spearman correlation is -1, and these two vectors are considered "identical".
However, k-means will now compute the mean of these two, which yields the constant vector
3 3 3 3 3
which is as dis-similar to these two (in fact, it's correlation with anything isn't even well defined). In other words: the mean does not minimize absolute correlation, and you shouldn't use this distance function.
This is why you should be using k-means only with squared Euclidean distance.
This is easy to see when looking at the definition of cosine similarity, and the reason why spherical k-means also works.