I have multiple very large files(nearly 500MB) as input to my MR program. I divide(split) these files into equal size partitions. Each Mapper gets single partition of a file
Mapper : Key=(filename, partition_number) and Value= (character stream of partition)
I am applying some computation on value(character stream) in mapper. I want to gather result corresponding to a input file(for all of its partitons) in one reducer. So I thought of reducer i/p key as 'filename'. But those output from mapper must be gathered sequentially in reducer.( like [partition1 o/p + partition2 +...+partitionN o/p] )
Can you plz suggest me the logic. Thanks.
You need a secondary sort. For an example see https://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/
In this case"