python amazon-ec2 apache-spark emr pyspark

How to run a PySpark job (with custom modules) on Amazon EMR?

I want to run a PySpark program that runs perfectly well on my (local) machine.

I have an Amazon Elastic Map Reduce cluster running, with all the necessary dependencies installed (Spark, Python modules from PyPI).

Now, how do I run a PySpark job that uses some custom modules? I have been trying many things for maybe half a day, now, to no avail. The best command I have found so far is:

/home/hadoop/spark/bin/spark-submit --master yarn-cluster \
    --py-files s3://bucket/custom_module.py s3://bucket/pyspark_program.py

However, Python fails because it does not find custom_module.py. It seems to try to copy it, though:

INFO yarn.Client: Uploading resource s3://bucket/custom_module.py -> hdfs://…:9000/user/hadoop/.sparkStaging/application_…_0001/custom_module.py

INFO s3n.S3NativeFileSystem: Opening 's3://bucket/custom_module.py' for reading

This looks like an awfully basic question, but the web is quite mute on this, including the official documentation (the Spark documentation seems to imply the command above).

Solution

This is a bug of Spark 1.3.0.

The workaround consists in defining SPARK_HOME for YARN, even though this should be unnecessary:

spark-submit … --conf spark.yarn.appMasterEnv.SPARK_HOME=/home/hadoop/spark \
               --conf spark.executorEnv.SPARK_HOME=/home/hadoop/spark …