Search code examples
apache-sparkamazon-emrlivy

Cannot send Python dependencies to Spark on EMR via Livy


I searched and viewed a looot of articles online and finally found a possible solution in:

I can't seem to get --py-files on Spark to work

I followed the most voted answer in that link but it is not working for me. And that topic is 3 years old and I think I should submit a new question.

So what I did is like in that answer:

  1. Build my python dependencies into a .zip:
pip install -t dependencies -r requirements.txt
cd dependencies
zip -r ../dependencies.zip .

And then upload it with other python files to my s3 bucket.

  1. In my_job.py, add the following line to add my python dependencies.zip to PYTHONPATH:
sc.addPyFile("dependencies.zip")

Note that this line is added at the very beginning of my python code.

  1. Submit my Spark job. Well, in the link it used spark-submit, in my case I use curl through Apache Livy after ssh to my AWS EMR master:
curl -X POST --data '{"file": "s3://my-bucket/src/my_job.py", "pyFiles": ["s3://my-bucket/src/global_settings.py", "s3://my-bucket/src/sys_init.py", "s3://my-bucket/src/dependencies.zip"]}' -H "Content-Type: application/json" http://ec2-1-23-456-789.compute-1.amazonaws.com:8998/batches

Now, from the log of Livy, I got a Java exception:

java.io.FileNotFoundException: File file:/dependencies.zip does not exist

The full error trace is like this:

Traceback (most recent call last):
  File "/mnt/tmp/spark-bfbaa559-69b8-48cb-9287-c30f73ab6d4a/my_job.py", line 1, in <module>
    from global_settings import *
  File "/mnt/tmp/spark-bfbaa559-69b8-48cb-9287-c30f73ab6d4a/global_settings.py", line 28, in <module>
    sc.addFile('dependencies.zip')
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 898, in addFile
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o53.addFile.
: java.io.FileNotFoundException: File file:/dependencies.zip does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:640)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:866)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:630)
    at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:452)
    at org.apache.spark.SparkContext.addFile(SparkContext.scala:1544)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

Totally have no idea on why spark cannot find dependencies.zip while it can find out the other two python files global_settings.py and sys_init.py.

Also note that the dependencies.zip is 48 MB. Does the file size matter? I don't think it is a problem for AWS EMR to read a 48 MB file from AWS S3. They are in the same region and under the same VPC.

And my requirements.txt:

pandas
sqlalchemy
boto3

Please help. Thanks.


Solution

  • The files, zips, eggs mentioned as part of pyFiles in the curl call will set the spark config spark.submit.pyFiles. Spark takes cares of downloading the files and add adding the files/zips to PYTHONPATH.
    Use don't need to add the file add the files again using sc. addPyFile(<>).(The above code is trying to look for filename dependencies in the default FS which in your case is file:// and Spark is not able to find it).
    We can remove the addPyFile call, and try to import a class in the zip file specified to confirm if its part of the PYTHONPATH.