Search code examples
hadoopcompletion-service

Wait for completion of several jobs in Hadoop


I need to submit several jobs, which will use same input folder but produce different results in different output folders. These jobs should run in parallel and don't depend on each other.

Is there any simple way to wait for completion of all these jobs (like CompletionService in java concurrent package), or I need to create it from scratch - remember jobids of all jobs and check statuses of all jobs periodically?


Solution

  • If you are using the new Java MapReduce API, you can use the JobControl object to schedule multiple ControlledJob instances with dependencies. It simply involves wrapping all your Job objects in ControlledJob objects and ControlledJob.addDependingJob(ControlledJob dependingJob) to register the dependencies of a job. For instance, if jobC depends on jobA and jobB before it can run:

    Configuration conf = new Configuration();
    Job jobA = new ControlledJob(new Job(conf));
    Job jobB = new ControlledJob(new Job(conf));
    
    Job jobC = new ControlledJob(new Job(conf));
    jobC.addDependingJob(jobA);
    jobC.addDependingJob(jobB);
    
    JobControl jobControl = new JobControl();
    jobControl.addJob(jobA);
    jobControl.addJob(jobB);
    jobControl.addJob(jobC);
    
    jobControl.run();
    

    The JobControl object will then ensure that a job does not run until the jobs it depends on have completed.

    The jobs themselves are configured seperately as you normally would set up a single job. This makes it a simple task to configure shared or separate input/output paths.