Search code examples
machine-learningcluster-computingmahout

Mahout Clustering not reading input


Hi guys I am trying to run a cluster dump for a k-mean Clustering Algo. It's not Working. Any Idea? This is the example from the Mahout in Action on a psudo mode cluster.

Any tool or mean to visualize the output from the cluster dump or output from K-mean.

[186946@01HW534064 bin]$ ./mahout clusterdump -dt sequencefile -d /home/186946/reuters-vectors/dictionary.file-0-i reuters-fkmeans-clusters/clusters-3 -o /home/186946/clusters.txt -b 10 -n 10
Running on hadoop, using HADOOP_HOME=/home/186946/hadoop-0.20.2-cdh3u5
No HADOOP_CONF_DIR set, using /home/186946/hadoop-0.20.2-cdh3u5/src/conf 
MAHOUT-JOB: /home/186946/mahout-0.5-cdh3u5/mahout-examples-0.5-cdh3u5-job.jar
MAHOUT-JOB: /home/186946/mahout-0.5-cdh3u5/mahout-examples-0.5-cdh3u5-job.jar
13/03/08 17:26:11 ERROR common.AbstractJob: Unexpected reuters-fkmeans-clusters/clusters-3 while processing Job-Specific Options:
usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
 -archives <paths>              comma separated archives to be unarchived
                                on the compute machines.
 -conf <configuration file>     specify an application configuration file
 -D <property=value>            use value for given property
 -files <paths>                 comma separated files to be copied to the
                                map reduce cluster
 -fs <local|namenode:port>      specify a namenode
 -jt <local|jobtracker:port>    specify a job tracker
 -libjars <paths>               comma separated jar files to include in
                                the classpath.
 -tokenCacheFile <tokensFile>   name of the file with the tokens
Unexpected reuters-fkmeans-clusters/clusters-3 while processing Job-Specific    
Options:                                                                        
Usage:                                                                          
 [--seqFileDir <seqFileDir> --output <output> --substring <substring>           
--numWords <numWords> --pointsDir <pointsDir> --dictionary <dictionary>         
--dictionaryType <dictionaryType> --help --tempDir <tempDir> --startPhase       
<startPhase> --endPhase <endPhase>]                                             
Job-Specific Options:                                                           
  --seqFileDir (-s) seqFileDir             The directory containing Sequence    
                                           Files for the Clusters               
  --output (-o) output                     Optional output directory. Default   
                                           is to output to the console.         
  --substring (-b) substring               The number of chars of the           
                                           asFormatString() to print            
  --numWords (-n) numWords                 The number of top terms to print     
  --pointsDir (-p) pointsDir               The directory containing points      
                                           sequence files mapping input vectors 
                                           to their cluster.  If specified,     
                                           then the program will output the     
                                           points associated with a cluster     
  --dictionary (-d) dictionary             The dictionary file                  
  --dictionaryType (-dt) dictionaryType    The dictionary file type             
                                           (text|sequencefile)                  
  --help (-h)                              Print out help                       
  --tempDir tempDir                        Intermediate output directory        
  --startPhase startPhase                  First phase to run                   
  --endPhase endPhase                      Last phase to run                    
13/03/08 17:26:11 INFO driver.MahoutDriver: Program took 133 ms

Thanks


Solution

  • mahout clusterdump \
    -d output/vectors/dictionary.file-0 \
    -dt sequencefile \
    -i output/clusters/clusters-2-final/part-00000 \
    -n 20 \
    -b 100 \
    -o cdump.txt \
    -p output/clusters/clusteredPoints/
    

    Just copy paste all lines above in a text editor, put your parameters for -d, -dt, -i , -p carefully as mine.

    p.s paths are from HDFS.