I'm new to Hadoop, and are confused with Mapper parameters.
Take the well known WordCount as sample:
class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text outputKey;
private IntWritable outputVal;
@Override
public void setup(Context context) {
outputKey = new Text();
outputVal = new IntWritable(1);
}
@Override
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer stk = new StringTokenizer(value.toString());
while(stk.hasMoreTokens()) {
outputKey.set(stk.nextToken());
context.write(outputKey, outputVal);
}
}
}
See map
function, the parameters are Object key
, Text value
and Context context
, I'm confused about what the Object key
looks like (you see, key
is never used in Map
function).
Since the input file format is like:
Deer
Beer
Bear
Beer
Deer
Deer
Bear
...
I know value looks like each line Deer
, Beer
, and so on. They are processed line by line.
But how is key looks like? How to decide which data type key should use?
Everything here depends on your InputFormat
class used. It parses input data source and provides you with (Key, Value) pairs. Different input format implementations can provide you with different stream even having the same input source.
Here is article which demonstrates approach:
Main driver here is RecordReader
.