I need to write a Map reduce program that calls two reducers in succession. ie, output of first reducer will be the input to the second reducer. How do I achieve this?
What I have found so far suggests that I will need to configure two map reduce jobs in my driver code(code below).
This looks wasteful, for two reasons -
Is there a better way to achieve this?
Also, a question on the below approach : Job1's output would be multiple files in the OUTPUT_PATH directory. This directory is passed in as Job2's input, is this okay? Does it not have to be a file? Will Job2 process all files under the given directory?
Configuration conf = getConf();
FileSystem fs = FileSystem.get(conf);
Job job = new Job(conf, "Job1");
job.setJarByClass(ChainJobs.class);
job.setMapperClass(MyMapper1.class);
job.setReducerClass(MyReducer1.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.addInputPath(job, new Path(args[0]));
TextOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
job.waitForCompletion(true); /*this goes to next command after this job is completed. your second job is dependent on your first job.*/
/*
* Job 2
*/
Configuration conf2 = getConf();
Job job2 = new Job(conf2, "Job 2");
job2.setJarByClass(ChainJobs.class);
job2.setMapperClass(MyMapper2.class);
job2.setReducerClass(MyReducer2.class);
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(Text.class);
job2.setInputFormatClass(TextInputFormat.class);
job2.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.addInputPath(job2, new Path(OUTPUT_PATH));
TextOutputFormat.setOutputPath(job2, new Path(args[1]));
return job2.waitForCompletion(true) ? 0 : 1;
dont really need a maper in second job
The framework does, though
having two jobs looks like an overkill... Is there a better way to achieve this?
Then don't use MapReduce... Spark, for example would likely be faster and have less code
Will Job2 process all files under the given directory?
Yes