Search code examples
javahadoopmapreduce

Map -> Reduce -> Reduce (two reducers to be called sequentially) - how to configure driver program


I need to write a Map reduce program that calls two reducers in succession. ie, output of first reducer will be the input to the second reducer. How do I achieve this?

What I have found so far suggests that I will need to configure two map reduce jobs in my driver code(code below).
This looks wasteful, for two reasons -

  1. I dont really need a maper in second job
  2. having two jobs looks like an overkill.

Is there a better way to achieve this?

Also, a question on the below approach : Job1's output would be multiple files in the OUTPUT_PATH directory. This directory is passed in as Job2's input, is this okay? Does it not have to be a file? Will Job2 process all files under the given directory?

Configuration conf = getConf();
  FileSystem fs = FileSystem.get(conf);
  Job job = new Job(conf, "Job1");
  job.setJarByClass(ChainJobs.class);

  job.setMapperClass(MyMapper1.class);
  job.setReducerClass(MyReducer1.class);

  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(IntWritable.class);

  job.setInputFormatClass(TextInputFormat.class);
  job.setOutputFormatClass(TextOutputFormat.class);

  TextInputFormat.addInputPath(job, new Path(args[0]));
  TextOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));

  job.waitForCompletion(true); /*this goes to next command after this job is completed. your second job is dependent on your first job.*/


  /*
   * Job 2
   */
  Configuration conf2 = getConf();
  Job job2 = new Job(conf2, "Job 2");
  job2.setJarByClass(ChainJobs.class);

  job2.setMapperClass(MyMapper2.class);
  job2.setReducerClass(MyReducer2.class);

  job2.setOutputKeyClass(Text.class);
  job2.setOutputValueClass(Text.class);

  job2.setInputFormatClass(TextInputFormat.class);
  job2.setOutputFormatClass(TextOutputFormat.class);

  TextInputFormat.addInputPath(job2, new Path(OUTPUT_PATH));
  TextOutputFormat.setOutputPath(job2, new Path(args[1]));

  return job2.waitForCompletion(true) ? 0 : 1;

Solution

  • dont really need a maper in second job

    The framework does, though

    having two jobs looks like an overkill... Is there a better way to achieve this?

    Then don't use MapReduce... Spark, for example would likely be faster and have less code

    Will Job2 process all files under the given directory?

    Yes