python hadoop hadoop-yarn emr amazon-emr

My python job I run on the master of EMR cluster fails, how do I troubleshoot?

I ssh to the master and run my hadoop job on the console for development purposes. My job fails in a mysterious way, with many java stack traces that make no sense to me, see below:

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 143
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:120)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:455)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:344)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)

Solution

Look at the logs for an error in your python code. For EMR/yarn you can find your logs from the WEB UI or on the cluster master shell as shown below (your application id will differ it is printed when the jobs starts). There is a lot of output, redirect it into a file as I show and search for python stack traces to see what went wrong with your app. All these stack traces usually indicate that at least one reduce process failed but the stderr of the process is not shown in the CLI/shell output.

$ yarn logs -applicationId application_1503951120983_0031 > /tmp/log