Search code examples
hadoopmapreduceemr

Running multiple "light" mapreduce or a single "heavy" mapreduce


I am writing a mapreduce program that would run on AWS EMR.
My program calculates probabilities out of the google ngram corpus.
I was wondering if there is a difference between running a single mapreduce that handles all calculations at once and multiple mapreduce that handles one calculation at each time.
Both are done without using any data structures (arrays, lists...).
Is there a difference in terms of efficiency? or network communication?
Both are doing exactly the same in in the same manner, I only separate the calculations the job of the reducer.


Solution

  • Yes there will be a difference between them but the magnitude of difference depend on your map reduce program.

    Reason for difference is when you will run multiple light map reduce program then there is going to be head over of starting and executing multiple map and reducer as each map reduce program when start require allocation of container for which application master has to communicate back and forth between resource manager and node manager, new log files are generated, network communication between name node and datanode are required similarly there are many other head overs also. So single heavy map reduce is better then various light map reduce if your program is not that large.

    But if your single map reducer program is too large and complex such that it cause clogging in JVM and memory ( which acc to me is highly unlikely unless your cluster hardware are too minimal ) then multiple small map reduce are more feasible.

    From you question I have a intuition that your map reduce is not that large so I will suggest you to go ahead with single heavy map reduce.