Search code examples
apachehadoop2

Hadoop map reduce job modelling


I am fairly new to hadoop and I need help modelling a map reduce job.

I have two groups of files: GroupA and GroupB. The structure of both group of files is same: key,value in each line. Group A and B have same set of keys. However the values in the two groups different properties. The files are sufficiently large and hence the hadoop option.

The task is to combine the properties from group A and group B for each individual key into a third property for that key and then afterwards sum up the third property for all the keys.

Now, on the first glance what seems like is: Map -> collect the key -value pairs from both groupfiles Combine-partition-sort-shuffle -> group the entries of same key into same partition, so they fall to same reducer(handled by hadoop internally) reduces -> combine the same key values into the third property and writes its batches into the output files.

I am not sure how to model the third step of adding up the third property across the keys. One way I can think of is to have another map-red job after this one which can take this files and combine them through one reducer into the result. Is this the right way of modelling ? Is there any other way, I can model this ? Is it possible to have consecutive reducers along the lines of something like this - map -> red -> red ?


Solution

  • The model in hadoop would go something like having two map reduce models triggered one after the other. If we use spark over hadoop there is something called count which can be invoked after the map-reduce to get the final output.