Search code examples
javahadoophadoop-streaminghadoop-partitioningbigdata

Gathering multiple mapper's result sorted at Reducer in Hadoop


I have multiple very large files(nearly 500MB) as input to my MR program. I divide(split) these files into equal size partitions. Each Mapper gets single partition of a file

Mapper : Key=(filename, partition_number) and Value= (character stream of partition)

I am applying some computation on value(character stream) in mapper. I want to gather result corresponding to a input file(for all of its partitons) in one reducer. So I thought of reducer i/p key as 'filename'. But those output from mapper must be gathered sequentially in reducer.( like [partition1 o/p + partition2 +...+partitionN o/p] )

Can you plz suggest me the logic. Thanks.


Solution

  • You need a secondary sort. For an example see https://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/

    In this case"

    • Primary Comparator compares on [filename, partition_number]
    • Group Comparator on filename only
    • Partitioner on filename only