Search code examples
javahadoopmahoutoozie

Java code or Oozie


I'm new to Hadoop, so I have some doubts what to do in the next case. I have an algorithm that includes multiple runs of different jobs and sometimes multiple runs of a single job (in a loop).

How should I achieve this, using Oozie, or using Java code? I was looking through Mahout code and in ClusterIterator function function found this:

 public static void iterateMR(Configuration conf, Path inPath, Path priorPath, Path outPath, int numIterations)
               throws IOException, InterruptedException, ClassNotFoundException {

    ClusteringPolicy policy = ClusterClassifier.readPolicy(priorPath);
    Path clustersOut = null;
   int iteration = 1;
   while (iteration <= numIterations) {
      conf.set(PRIOR_PATH_KEY, priorPath.toString());

      String jobName = "Cluster Iterator running iteration " + iteration + " over priorPath: " + priorPath;
      Job job = new Job(conf, jobName);
      job.setMapOutputKeyClass(IntWritable.class);
      job.setMapOutputValueClass(ClusterWritable.class);
      job.setOutputKeyClass(IntWritable.class);
      job.setOutputValueClass(ClusterWritable.class);

      job.setInputFormatClass(SequenceFileInputFormat.class);
      job.setOutputFormatClass(SequenceFileOutputFormat.class);
      job.setMapperClass(CIMapper.class);
      job.setReducerClass(CIReducer.class);

      FileInputFormat.addInputPath(job, inPath);
      clustersOut = new Path(outPath, Cluster.CLUSTERS_DIR + iteration);
      priorPath = clustersOut;
      FileOutputFormat.setOutputPath(job, clustersOut);

      job.setJarByClass(ClusterIterator.class);
      if (!job.waitForCompletion(true)) {
         throw new InterruptedException("Cluster Iteration " + iteration + " failed processing " + priorPath);
      }
      ClusterClassifier.writePolicy(policy, clustersOut);
      FileSystem fs = FileSystem.get(outPath.toUri(), conf);
      iteration++;
      if (isConverged(clustersOut, conf, fs)) {
        break;
      }
    }
    Path finalClustersIn = new Path(outPath, Cluster.CLUSTERS_DIR + (iteration - 1) + Cluster.FINAL_ITERATION_SUFFIX);
    FileSystem.get(clustersOut.toUri(), conf).rename(clustersOut, finalClustersIn);
   }

So, they have a loop in which they run MR jobs. Is this a good approach? I know that Oozie is used for DAGs, and can be used with another components, such Pig, but should I consider using it for something like this?

What if I want to run clustering algorithm multiple times, let's say for clustering (using specific driver), should I do that in a loop, or using Oozie.

Thanks


Solution

  • If you are looking to run map reduce jobs only then you can consider following ways

    • chain MR jobs using Map reduce job Control API.

    http://hadoop.apache.org/docs/r2.5.0/api/org/apache/hadoop/mapreduce/lib/jobcontrol/JobControl.html

    • Submit multiple MR jobs from a single driver class.

      Job job1 = new Job( getConf() ); job.waitForCompletion( true );

      if(job.isSuccessful()){ //start another job with different Mapper.

      //change config
      Job job2 = new Job( getConf() );
      

      }

    If you have a complex DAG or involving multiple ecosystem tools like hive,pig then Oozie suits well.