machine-learning cluster-analysis mahout

Mahout: RowSimilarity vs Clustering

I was trying to cluster some documents using the KMeansClustering approach and successfully created the clusters. I saved the cluster id corresponding to a particular document for recommendations. So whenever I wanted to recommend documents similar to a particular document, I would query all the documents in a particular cluster and return n random documents from the cluster. However, returning any random document from the cluster did not seem appropriate and I read somewhere that we should be returning the documents nearest to the document in question.

So I started searching for calculating distance between documents and stumbled upon the RowSimilarity approach which returns 10 most similar documents to each document, ordered by distance. Now this approach relies on a similarity metric like LogLikelihood etc to calculate the distance between documents.

Now my question is this. How is clustering better/worse than RowSimilarity given that both the approaches use a similarity distance metric to calculate the distance between documents?

What I'm trying to achieve is that I'm trying to cluster products on the basis of their titles and other text properties to recommend similar products. Any help is appreciated.

Solution

Clustering is not just another variant of classification or recommendation. It is a different discipline.

When you are doing cluster analysis, you want to discover structure in the data. But then, you should actually be analyzing the structure you found.

Now k-means is not really meant for documents. It tries to find a near optimal partitioning of a data set into k Voronoi cells. Unless you have a good reason to believe that Voronoi cells are a good partitioning for your data, the algorithm may be pretty much useless. Just because it returns a result does not at all indicate that the result is useful.

For documents, Euclidean distance (and k-means is in fact optimizing Euclidean distances) are usually pretty much meaningless. The vectors are very sparse, and k-means cluster centers will then often resemble impossible (and thus insensible) "average documents".

And I havn't started on the need to find an appropriate value of k, on the Mahout implementation likely just being an approximation of Lloyds k-means approximation, and so on. Did you even check the cluster sizes? In situations like these, k-means will often produce degenerate results. For example, almost all clusters containing 1 or 0 elements, and a mega-cluster containing the rest. In this situation, you might in fact be returning just random documents from your database...

Just because you can use it does not mean it is helpful. Make sure to validate the individual steps of your approach, for example if the clusters are in any way useful and sensible!