Search code examples
hadoopmahout

Mahout - Seq2Sparse Single Reducer


I've been running a seq2sparse job for many days now and it just doesn't finish. The primary reason being that most of the "sub-jobs" have only 1 reducer, while there are many many mappers for each job.

I specified --numReducers = n while invoking seq2sparse from the command line, but that attribute is used only in some places such as MakePartialVectors but not for sub jobs like Prune Vectors.

What could be the reason?


Solution

  • I looked in the code and realized that the numReducers variable is not passed along to all the sub jobs and therefore those jobs get created with the default reduce capacity i.e 1

    To get around this limitation, one should simply specify the variable -Dmapred.reduce.tasks=n while invoking the job from the command line along with the --numReducers=n parameter

    Its necessary to also specify the numReducers since by default the Mahout CLI takes it to be one.

    So an example of a command would be

    ./mahout seq2sparse -Dmapred.reduce.tasks=10 -i seq-files -o vectors -nv -wt tfidf -ng 2 --numReducers 10 --maxDFPercent 90 --minDF 2 --norm 2 --minLLR 20