Search code examples
hadoopmapreduceamazon-emroozieapache-crunch

Apache Crunch Job On AWS EMR using Oozie


Context:

  • I want to run an apache crunch job on AWS EMR
  • this job is part of a pipeline of oozie java actions and oozie subworkflows (this particular job is part of a subworkflow). In oozie we have a main workflow and inside this workflow we have multiple java actions (which specify the entry, the java class that it should run) and one subworkflow ( which contains multiple java actions. One of the java actions has our job driver class).
  • this job is reading some data from HDFS and writes to HDFS (reads the data in a PCollection -> does some processing on it -> writes the PCollection to HDFS.

Issue:

  • the PCollection is not written to hdfs (although we do a PCollection.write we can't see anything on HDFS, but the _SUCCESS file is generated in that directory).

Things I've seen:

  • I've created a dummy method which takes a PCollection (generated of 3 strings), did a map on it (change all the letters to uppercase) and then wrote the resulting PCollection on HDFS. I added this piece of code in the subworkflow java action (the one that contains the job that does not write to HDFS) and in a top level java action (by top level java action I mean actions that are under the main workflow). In the subworkflow I have the same issue, the result is not written in HDFS, but on the top level Java Action it writes the dummy output and I can see the file written in HDFS).

Exemple of oozie pipeline:

  • Main Workflow
  •   Java_Action_1 (which points to a java class that is being run)
    
  •   Java_Action_2 (which points to a java class that is being run)
    
  •   Java_Action_3 (which points to a java class that is being run)
    
  •   Subworkflow_1 (has a fork and join step, seen it in the Oozie UI)
    
  •         Java_Action_1_in_subworkflow (which points to a java class that is being run) -> job that is not writing to HDFS
    
  •          Java_Action_1_in_subworkflow (which points to a java class that is being run)
    
  •   Java_Action_4 (which points to a java class that is being run)
    
  •   Java_Action_5 (which points to a java class that is being run)
    
  •   etc.
    

Solution

  • The issue was with the fs.defaultFS hadoop property. We were using viewfs and the output paths that were given to apache crunch were prefixed with viewfs:// . Because of this it was not able to write to HDFS. So we set the defaultFS to hdfs:// for the writing phase. The reading is from s3 bucket which is mounted as /folder_name on hdfs. For the reading phase the files had to be prefixed with viewfs://.