I want 20 % of sample data from the input dataset.
I thought of 2 approaches:
Initially emitting 20 % data from each mapper (single mapper emits 20% of data).Then, the reducer finds 20 % of mapper data after shuffle and sort.(Same procedure applied for both Map and Reduce)
Simply emit each line from mapper and then find 20% of sample data from total data in Reducer.(processing only done is Reducer)
Which is the better approach?
I would definitely go with your first option. I'm not sure why you need a reducer though. Just filter out 20% in the map phase and call it a day.