I have to generate n*(n-1)/2 candidate pairs, from a list of n candidates.
This can be done in every mapper instance or in every reducer instance.
But I observed that, when this operation was done in Reduce phase it was way faster than done in the Map Phase. What is the reason?
Can Mappers not support heavy computation?
What is the impact of a Mapper instance doing such a computation on the network?
Thanks!
The short answer is : when use mapper to generate data, Hadoop have to copy the data from mapper to redcuer, this cost too much time.
The total data generated is O(n^2)
.
If you generate n*(n-1)/2
pairs using mapper, the intermediate data have to be copied to the reducer. This step in Hadoop is named Shuffle Phase. and reducer will still need to put these data to HDFS. The total data read/write from the Harddisk in your cause during the shuffle phase can be 6* sizeof(intermediate data)
, which is very large.
while if the data is generated by the reducer, the O(n^2)
intermediate data transformation is unnecessary. So it could have a better performance.
So your performance issue is mainly caused by data transformation, not computation. And if no disk-access, the mapper and reducer just have the same performance.
If you still want to use mapper to generate the data, maybe the io.sort.factor
, turn on compression may help improve the performance.