java hadoop mapreduce cluster-computing distributed-computing

Hadoop Mapper parameters explaination

I'm new to Hadoop, and are confused with Mapper parameters.

Take the well known WordCount as sample:

class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
  private Text outputKey;
  private IntWritable outputVal;

  @Override
  public void setup(Context context) {
    outputKey = new Text();
    outputVal = new IntWritable(1);
  }

  @Override
  public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
    StringTokenizer stk = new StringTokenizer(value.toString());
    while(stk.hasMoreTokens()) {
      outputKey.set(stk.nextToken());
      context.write(outputKey, outputVal);
    }
  }
}

See map function, the parameters are Object key, Text value and Context context, I'm confused about what the Object key looks like (you see, key is never used in Map function).

Since the input file format is like:

Deer
Beer
Bear
Beer
Deer
Deer
Bear
...

I know value looks like each line Deer, Beer, and so on. They are processed line by line.

But how is key looks like? How to decide which data type key should use?

Solution

Everything here depends on your InputFormat class used. It parses input data source and provides you with (Key, Value) pairs. Different input format implementations can provide you with different stream even having the same input source.

Here is article which demonstrates approach:

https://hadoopi.wordpress.com/2013/05/31/custom-recordreader-processing-string-pattern-delimited-records/

Main driver here is RecordReader.