Search code examples

How to run a PySpark job (with custom modules) on Amazon EMR?

I want to run a PySpark program that runs perfectly well on my (local) machine.

I have an Amazon Elastic Map Reduce cluster running, with all the necessary dependencies installed (Spark, Python modules from PyPI).

Now, how do I run a PySpark job that uses some custom modules? I have been trying many things for maybe half a day, now, to no avail. The best command I have found so far is:

/home/hadoop/spark/bin/spark-submit --master yarn-cluster \
    --py-files s3://bucket/ s3://bucket/ 

However, Python fails because it does not find It seems to try to copy it, though:

INFO yarn.Client: Uploading resource s3://bucket/ -> hdfs://…:9000/user/hadoop/.sparkStaging/application_…_0001/

INFO s3n.S3NativeFileSystem: Opening 's3://bucket/' for reading

This looks like an awfully basic question, but the web is quite mute on this, including the official documentation (the Spark documentation seems to imply the command above).


  • This is a bug of Spark 1.3.0.

    The workaround consists in defining SPARK_HOME for YARN, even though this should be unnecessary:

    spark-submit … --conf spark.yarn.appMasterEnv.SPARK_HOME=/home/hadoop/spark \
                   --conf spark.executorEnv.SPARK_HOME=/home/hadoop/spark …