Search code examples
javaperformancehadoopmapreduceelastic-map-reduce

Mapper vs Reducer Computation Time and effect on network performance Hadoop


I have to generate n*(n-1)/2 candidate pairs, from a list of n candidates.

This can be done in every mapper instance or in every reducer instance.

But I observed that, when this operation was done in Reduce phase it was way faster than done in the Map Phase. What is the reason?

Can Mappers not support heavy computation?

What is the impact of a Mapper instance doing such a computation on the network?

Thanks!


Solution

  • The short answer is : when use mapper to generate data, Hadoop have to copy the data from mapper to redcuer, this cost too much time.

    result total data size

    The total data generated is O(n^2).

    comparesion of data generation by mapper VS reducer

    If you generate n*(n-1)/2 pairs using mapper, the intermediate data have to be copied to the reducer. This step in Hadoop is named Shuffle Phase. and reducer will still need to put these data to HDFS. The total data read/write from the Harddisk in your cause during the shuffle phase can be 6* sizeof(intermediate data), which is very large.

    while if the data is generated by the reducer, the O(n^2) intermediate data transformation is unnecessary. So it could have a better performance.

    So your performance issue is mainly caused by data transformation, not computation. And if no disk-access, the mapper and reducer just have the same performance.

    ways to improve performance of the mapper data generation strategy

    If you still want to use mapper to generate the data, maybe the io.sort.factor, turn on compression may help improve the performance.