Search code examples
javahadoopmapreducesampling

better way of sampling in Hadoop MapReduce


I want 20 % of sample data from the input dataset.

I thought of 2 approaches:

  1. Initially emitting 20 % data from each mapper (single mapper emits 20% of data).Then, the reducer finds 20 % of mapper data after shuffle and sort.(Same procedure applied for both Map and Reduce)

  2. Simply emit each line from mapper and then find 20% of sample data from total data in Reducer.(processing only done is Reducer)

Which is the better approach?


Solution

  • I would definitely go with your first option. I'm not sure why you need a reducer though. Just filter out 20% in the map phase and call it a day.