Search code examples
hadoopmahout

ClusterDump in Mahout 0.9


I have a question related to cluster dump in Mahout 0.9 while doing text clustering -

https://mahout.apache.org/users/clustering/clusteringyourdata.html

One case of cluster dump is to output the top k kerms and for that you don’t specify the parameter p (pointsDir).

Second case of cluster dump is where you specify the parameter p (pointsDir) and you get points associated with a cluster.

Both the outputs have same exact cluster id but the number of records shown in Case 1 – Where Top Terms are displayed is different than the number of records appearing in Case 2 – Where you get points associated with a cluster.

Why does this happen? I mean its bizzare to see different # of points associated with a specific cluster and not sure which one is correct?

Has anyone seen this happening?

Thank you in advance!


Solution

  • Finally after searching a lot about this issue on the web, I found a link discussing this problem -

    http://qnalist.com/questions/4874723/mahout-clusterdump-output

    Although what caught my attention was this explanation below -

    I think the discrepancy between the number (n=) of vectors reported by the cluster and the number of points actually clustered by the -cl option is normal. * In the final iteration, points are assigned to (observed by) (classified as) each cluster based upon the distance measure and the cluster center computed from the previous iteration. The (n=) value records the number of points "observed by" the cluster in that iteration. * After the final iteration, a new cluster center is calculated for each cluster. This moves the center by some amount, less than the convergence threshold, but it moves. * During the subsequent classification (-cl) step, these new centers are used to classify the points for output. This will inevitably cause some points to be assigned to (observed by) (classified as) a different cluster and so the output clusteredPoints will reflect this final assignment. In small, contrived examples, the clustering will likely be more stable between the final iteration and the output of clustered points. I think the discrepancy between the number (n=) of vectors reported by the cluster and the number of points actually clustered by the -cl option is normal. In the final iteration, points are assigned to (observed by) (classified as) each cluster based upon the distance measure and the cluster center computed from the previous iteration. The (n=) value records the number of points "observed by" the cluster in that iteration. After the final iteration, a new cluster center is calculated for each cluster. This moves the center by some amount, less than the convergence threshold, but it moves. During the subsequent classification (-cl) step, these new centers are used to classify the points for output. This will inevitably cause some points to be assigned to (observed by) (classified as) a different cluster and so the output clusteredPoints will reflect this final assignment. In small, contrived examples, the clustering will likely be more stable between the final iteration and the output of clustered points.