I tried to follow many online tutorials to run kmeans example present in Mahout. But did not succeed yet to get meaningful output. The main problem I am facing is, the conversion from text file to sequencefile and back.
When I followed the steps of "Mahout Wiki" for "Clustering of synthetic control data" (https://cwiki.apache.org/MAHOUT/clustering-of-synthetic-control-data.html) I could run the clustering process (using $MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job) and that created some readable console output. But I wish to get output files (as the size is large) from the clustering process. The output files which were generated by Mahout clustering are all sequence file and I cant convert them to readable files. When I tried to do "clusterdump" ($MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-10...) I got errors. First it complains that "seqFileDir" option is unexpected and I guess either there is no "seqFileDir" for clusterdump or I am missing something.
Trying to use Mahout in the way of "mahout in action" seems tricky. I am not sure what are the required classes ("import ??") to compile that code.
Can you please suggest me the steps to successfully RUN kmeans on Mahout ? Specially how to get readable output from sequence files ?
Regarding 2nd question - you can obtain source code for the book from the repository. The code in master
branch is for Mahout 0.5, while code in the branches mahout-0.6
& mahout-0.7
is for corresponding Mahout's version.
The source code is also posted to book's site, so you download it there (but this is version only for Mahout 0.5)
P.S. If you're reading book right now, then I recommend to use Mahout 0.5 or 0.6, as all code was checked for version 0.5, while for other versions it will be different - this is especially true for clustering code in Mahout 0.7