In a mapreduce job consisting of select count(*) from products where id = 2
, where does the count(*)
operation take place, is it in the mapper or the reducer?
It can be both mapper and reducer or reducer only aggregation.
With map-side aggregation enabled:
hive.map.aggr=true;
data will be pre-aggregated (in the scope of split processed) on each mapper using Hash table. Reducer will do final aggregation of partial results received from mapper.
The mappers will output the pairs (#{token}, #{token_count})
. The Hadoop framework again sorts these pairs and the reducers sum the values to produce the total counts for each token. In this case, the mappers will each output one row for each token every time the map is flushed instead of one row for each occurrence of each token. The tradeoff is that they need to keep a map of all tokens in memory.
If map-side aggregation is switched-off: hive.map.aggr=false
, mapper will filter rows and send them to the reducer, reducer will do the aggregation, this can cause high network IO.
Read more details about Map-side Aggregation in Hive. See also related https://stackoverflow.com/a/61772631/2700344