Search code examples
hadoopmapreducemahoutcanopy

How to increase number of reducer in canopy clustering algorithm


I'm running canopy clustering algorithm using mahout.

This is the command I'm running through mahout Command line.

mahout canopy -i /mahout/o_seqsparse/tfidf-vectors -o /mahout/o_canopy -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -ow -t1 100 -t2 50

Below is number of map & reduce task running:

No. of map tasks runing --> 6

No. of reduce tasks runing --> 1

But this is taking too much time because of one reducer. I think, if I am able to increase the number of reduce tasks, then I will get better performance.

I also tried with increasing map reduce with mapred-site.xml file mapred.map.tasks, mapred.reduce.tasks But this has no effect, still it is running with 1 reduce.


Solution

  • You didnt specify the version of mahout you are using. But looking at the source code of 0.4 here: http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.4/org/apache/mahout/clustering/canopy/CanopyDriver.java

    You can find 1 reducer is hard coded. I dont think you can override it.

    EDIT

    For version 0.9 as you specified check here http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.9/org/apache/mahout/clustering/canopy/CanopyDriver.java/ at line no. 354

    job.setNumReduceTasks(1);
    

    Modify this and build again. However the map output must be sent to one reducer. In case of clustering I dont believe you will benefit much by increasing the number of reducers.