I have been reading some literature on Hadoop Map/Reduce and a general theme seems to be is : Hadoop Jobs are I/O intensive (Example : Sorting with Map/Reduce).
What makes these jobs I/O intensive (Given the fact Hadoop pushes computation to data)?
Example : Why is Sorting in Hadoop I/O intensive?
My intuition : It seems that after the map phase, the intermediate pairs are sent to reducers. Is this causing the huge I/O?
Hadoop is used for performing computations over the large amounts of data. Your jobs might be bounded by the resources of IO (I/O intensive as you call it), CPU and Network. In the classic case of Hadoop usage you are performing local computations over huge amounts of input data while returning relatively small result set, which makes your task be more IO intensive than CPU and Network intensive, but it hugely depends on the job itself. Here are some examples:
- IO Intensive job. You read much data on the map side, but the result of your map task is not that big. An example is calculating amount of rows in the input text, calculating the sum over some column from RCfile, getting the result of the Hive query over a single table with group by a column with relatively small cardinality. This would mean that the thing your job is doing is mostly reading data and make some simple processing over it.
- CPU Intensive job. When you need to perform some complex computations on the map or reduce side. For instance, you are doing some kind of the NLP (natural language processing) like tokenization, part of speach tagging, stemming and so on. Also if you store the data in a formats with high compression rates data decompression might become the bottleneck of the process (here's an example from Facebook where they were looking for a balance between CPU and IO)
- Network Intensive. Usually if you see high network utilization on the cluster it means that someone has missed the point and implemented the job that transfers much data over the network. In the example with wordcount, imagine the processing of 1PB of the input data within this job with only mapper and reducer, without combiner. This way the amount of data moved between map and reduce tasks would be even bigger than input data set, and all of this would be sent over the network. Also this might mean that you don't use intermediate data compression (mapred.compress.map.output and mapred.map.output.compression.codec) and the raw map output is sent over the network.
You can refer to this guide for the initial tuning of the cluster
So why sorting is IO intensive? First, you read the data from the disks. Next, in sorting the amount of data produced by mappers is the same amount that was read, means that most likely it wont fit in memory and should be spilled to the disks. Then it got transferred to reducers and spilled to the disks again. And then it got processed by reducer and got flushed to the disks once again. While the CPU needed for sorting is relatively small, especially if the sort key is a number and can be easily parsed from the input data.