Search code examples
javahadoopmapreduceword-count

reduce function in hadoop doesn't work


I learning hadoop. I wrote simple program in Java. Program have to counts words (and creates file with words and number of times each word appears), but program only creates a file with all words, and number "1" near every word. It's look like :

  • rmd 1
  • rmd 1
  • rmd 1
  • rmd 1
  • rmdaxsxgb 1

But I want :

  • rmd 4

  • rmdaxsxgb 1

As I understood, works only map function. (I tried to comment reduce function, and have the same result).

My code (it is a typical example, of mapreduce program; it can be easily finded in internet or books about hadoop):

public class WordCount {

 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken());
            context.write(word, one);
        }
    }
 } 

 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text key, Iterator<IntWritable> values, Context context) 
      throws IOException, InterruptedException {
        int sum = 0;
        while (values.hasNext()) {
            sum += values.next().get();
        }
        context.write(key, new IntWritable(sum));
    }
 }


 public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();

        Job job = new Job(conf, "wordcount");
        job.setJarByClass(WordCount.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.waitForCompletion(true);
    } }

I use hadoop on amazon web services, and don't understand why it doesn't work properly.


Solution

  • This could be because of the mix and match of the APIs. There are 2 APIs for hadoop the older being mapred and latest being mapreduce.

    In the latest API, the reducer handles the values as an Iterable compared to Iterator (old API) as in your code.

    Try -

    public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
    
        @Override
        protected void reduce(Text key, Iterable<IntWritable> values,
                Context context)
                throws IOException, InterruptedException {
    
            int sum = 0;
            for (IntWritable value:values) {
                sum += value.get();
            }
            context.write(key, new IntWritable(sum));
    
        }
    }