Search code examples
indexinglucenecluster-analysisk-meansmahout

mahout lucene document clustering howto?


I'm reading that i can create mahout vectors from a lucene index that can be used to apply the mahout clustering algorithms. http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text

I would like to apply K-means clustering algorithm in the documents in my Lucene index, but it is not clear how can i apply this algorithm (or hierarchical clustering) to extract meaningful clusters with these documents.

In this page http://cwiki.apache.org/confluence/display/MAHOUT/k-Means says that the algorithm accepts two input directories: one for the data points and one for the initial clusters. My data points are the documents? How can i "declare" that these are my documents (or their vectors) , simply take them and do the clustering?

sorry in advance for my poor grammar

Thank you


Solution

  • If you have vectors, you can run KMeansDriver. Here is the help for the same.

    Usage:
     [--input <input> --clusters <clusters> --output <output> --distance <distance>
    --convergence <convergence> --max <max> --numReduce <numReduce> --k <k>
    --vectorClass <vectorClass> --overwrite --help]
    Options
      --input (-i) input                The Path for input Vectors. Must be a
                                        SequenceFile of Writable, Vector
      --clusters (-c) clusters          The input centroids, as Vectors.  Must be a
                                        SequenceFile of Writable, Cluster/Canopy.
                                        If k is also specified, then a random set
                                        of vectors will be selected and written out
                                        to this path first
      --output (-o) output              The Path to put the output in
      --distance (-m) distance          The Distance Measure to use.  Default is
                                        SquaredEuclidean
      --convergence (-d) convergence    The threshold below which the clusters are
                                        considered to be converged.  Default is 0.5
      --max (-x) max                    The maximum number of iterations to
                                        perform.  Default is 20
      --numReduce (-r) numReduce        The number of reduce tasks
      --k (-k) k                        The k in k-Means.  If specified, then a
                                        random selection of k Vectors will be
                                        chosen as the Centroid and written to the
                                        clusters output path.
      --vectorClass (-v) vectorClass    The Vector implementation class name.
                                        Default is SparseVector.class
      --overwrite (-w)                  If set, overwrite the output directory
      --help (-h)                       Print out help
    

    Update: Get the result directory from HDFS to local fs. Then use ClusterDumper utility to get the cluster and list of documents in that cluster.