hadoop mapreduce amazon-emr oozie apache-crunch

Apache Crunch Job On AWS EMR using Oozie

Context:

I want to run an apache crunch job on AWS EMR
this job is part of a pipeline of oozie java actions and oozie subworkflows (this particular job is part of a subworkflow). In oozie we have a main workflow and inside this workflow we have multiple java actions (which specify the entry, the java class that it should run) and one subworkflow ( which contains multiple java actions. One of the java actions has our job driver class).
this job is reading some data from HDFS and writes to HDFS (reads the data in a PCollection -> does some processing on it -> writes the PCollection to HDFS.

Issue:

the PCollection is not written to hdfs (although we do a PCollection.write we can't see anything on HDFS, but the _SUCCESS file is generated in that directory).

Things I've seen:

I've created a dummy method which takes a PCollection (generated of 3 strings), did a map on it (change all the letters to uppercase) and then wrote the resulting PCollection on HDFS. I added this piece of code in the subworkflow java action (the one that contains the job that does not write to HDFS) and in a top level java action (by top level java action I mean actions that are under the main workflow). In the subworkflow I have the same issue, the result is not written in HDFS, but on the top level Java Action it writes the dummy output and I can see the file written in HDFS).

Exemple of oozie pipeline:

Main Workflow

  Java_Action_1 (which points to a java class that is being run)

  Java_Action_2 (which points to a java class that is being run)

  Java_Action_3 (which points to a java class that is being run)

  Subworkflow_1 (has a fork and join step, seen it in the Oozie UI)

        Java_Action_1_in_subworkflow (which points to a java class that is being run) -> job that is not writing to HDFS

         Java_Action_1_in_subworkflow (which points to a java class that is being run)

  Java_Action_4 (which points to a java class that is being run)

  Java_Action_5 (which points to a java class that is being run)

```
  etc.
```

Solution

The issue was with the fs.defaultFS hadoop property. We were using viewfs and the output paths that were given to apache crunch were prefixed with viewfs:// . Because of this it was not able to write to HDFS. So we set the defaultFS to hdfs:// for the writing phase. The reading is from s3 bucket which is mounted as /folder_name on hdfs. For the reading phase the files had to be prefixed with viewfs://.