So I'm trying to figure out how to interpret/analyse this clustering output I have. I have 50 folders, called clusters-0, clusters-1, clusters-2 and so on. This is because I said '-k 50' in my command. I thought these folders each contained one cluster, but now I'm not sure.
Using '--help' kmeans says that the '-cl' switch will: "If present, run clustering after the iterations have taken place."
So, does that mean that you need to use '-cl' for the clustering to actually happen?
If "-cl" is not used, are all those fifty folders just iterations of the k-means algorithm output and it doesn't produce an output that actually has the clusters.
Does each of those folders contain fifty clusters, and the final one is the best, most refined set of clusters?
About the folder structure that Mahout Kmeans generate:
/clusters - contains initial centroids of the clusters, based on these points distance measures are found for each individual data points.
/output/clusterPoints - contains the sequenceFile which has cluster id and data used for clustering in (key,value) format.
/output/clusters-* - Each of these folder contains data about the newly computed cluster centroid for each iterations.
/output/clusters-*-final - contains the final cluster details Heres what I have in it.
VL-1123{n=615 c=[0.655, 0.175, -1.042] r=[0.254, 0.086, 0.271]}
VL-376{n=1607 c=[-0.068, 0.184, 0.787] r=[0.152, 0.020, 0.113]}
VL-3492{n=375 c=[0.616, 0.111, 0.803] r=[0.289, 0.068, 0.227]}
VL-347{n=507 c=[-0.496, 0.166, 0.574] r=[0.169, 0.078, 0.196]}
VL-992{n=595 c=[0.154, 0.267, -0.394] r=[0.212, 0.083, 0.282]}
VL-2468{n=189 c=[-0.696, -0.008, -0.494] r=[0.247, 0.213, 0.372]}
Here I have 6 clusters, so it gives
ClusterID(1123), number of record in cluster(n=615), cluster centroid(c) and radius(r)
Also, VL represents the clusters have converged and it`s a good thing. Hope it helps!!