I'm running canopy clustering algorithm using mahout.
This is the command I'm running through mahout Command line.
mahout canopy -i /mahout/o_seqsparse/tfidf-vectors -o /mahout/o_canopy -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -ow -t1 100 -t2 50
Below is number of map & reduce task running:
No. of map tasks runing --> 6
No. of reduce tasks runing --> 1
But this is taking too much time because of one reducer. I think, if I am able to increase the number of reduce tasks, then I will get better performance.
I also tried with increasing map reduce with mapred-site.xml
file mapred.map.tasks, mapred.reduce.tasks
But this has no effect, still it is running with 1 reduce.
You didnt specify the version of mahout you are using. But looking at the source code of 0.4 here: http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.4/org/apache/mahout/clustering/canopy/CanopyDriver.java
You can find 1 reducer is hard coded. I dont think you can override it.
For version 0.9 as you specified check here http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.9/org/apache/mahout/clustering/canopy/CanopyDriver.java/ at line no. 354
job.setNumReduceTasks(1);
Modify this and build again. However the map output must be sent to one reducer. In case of clustering I dont believe you will benefit much by increasing the number of reducers.