I am trying to learn mapreduce. While starting from the WordCount Example as shown in MapReduce WordCount, when I execute the code in eclipse, it's output was correct word count. I/p File content was as follows:-
Hello World Bye World
It's output was
Bye 1
Hello 1
World 2
After that I test the code by replacing the space with comma after each word in the input file.
Now I have reverted the input to same as before but now WordCount in the output is double of the expected result.
Bye 2
Hello 2
World 4
My Code is as below:
public static class TokenizerMapper extends Mapper<Object, Text, Text,IntWritable>{
public static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException{
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()){
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException{
int sum=0;
for(IntWritable val:values){
sum +=val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] str) throws Exception{
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(str[0]));
FileOutputFormat.setOutputPath(job,new Path(str[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Can anybody also explain how values are grouped by for each word in Reducer Method as it's doing sum of each value for the specific word.Where it's checking that two counts are there for the same word.
Thanks
you must be given input folder as an input path in which you must have two files with same content and that might be reason for double count