algorithm math cluster-analysis weka distance

euclidean vs manhattan distance for text clustering

I am trying different clustering algorithms using Weka. When I try SimpleKMeans algorithm using euclidean distance I get less incorrectly classified instances, then when I try with Manhattan distance I get more incorrectly classified instances. What is the best distance metrics for text clustering and why? Why I get very different results ? I am using classes to cluster evaluation cluster mode.

Solution

Assuming a Bag of Words approach, the Manhattan distance is more suited for document comparison (the cosine distance is usually the best approach though), but the K-Means is a kind of gradient descent algorithm which assumes the cost function is differentiable, which is the case with the Euclidean distance but not in general with the Manhattan distance. So even though the Euclidean metric is not the best one for comparison, the K-Means might converge to a better solution with the Euclidean distance than with the Manhattan distance.