Search code examples
pythonpython-3.xpysparkamazon-emr

Problem importing modules from a .zip file (created in python using zipfile package) with --py-files on an EMR in Spark


I am trying to archive my application in my test file to spark submit on an EMR cluster like this:

Folder structure of modules:

app
--- module1
------ test.py
------ test2.py
--- module2
------ file1.py
------ file2.py

Zip function I'm calling from my tests

import zipfile
import os

def zip_deps():
    # make zip

    module1_path = '../module1'
    module2_path = '../module2'
    try:
        with zipfile.ZipFile('deps.zip', 'w', zipfile.ZIP_DEFLATED) as zipf:
            info = zipfile.ZipInfo(module1_path +'/')
            zipf.writestr(info, '')
            for root, dirs, files in os.walk(module1_path):
                for d in dirs:
                    info = zipfile.ZipInfo(os.path.join(root, d)+'/')
                    zipf.writestr(info, '')
                for file in files:
                    zipf.write(os.path.join(root, file),os.path.relpath(os.path.join(root, file)))

            info = zipfile.ZipInfo(module2_path +'/')
            zipf.writestr(info, '')
            for root, dirs, files in os.walk(module2_path):
                for d in dirs:
                    info = zipfile.ZipInfo(os.path.join(root, d)+'/')
                    zipf.writestr(info, '')
                for file in files:
                    zipf.write(os.path.join(root, file),os.path.relpath(os.path.join(root, file)))
    except:
        print('Unexpected error occurred while creating file deps.zip')
    zipf.close()

The deps.zip is created correctly, as far as I can see it zips all the files I want, and each module folder is at the base level of the zip. In fact the exact zip created using: zip -r deps.zip module1 module2 is the same structure and THIS works when I spark submit it with

spark-submit --py-files deps.zip driver.py 

Error from EMR:

Traceback (most recent call last):
  File "driver.py", line 6, in <module>
    from module1.test import test_function
ModuleNotFoundError: No module named 'module1'

FWIW I also tried zipping using a subprocess with the following commands and I got the same error on my EMR in spark

os.system("zip -r9 deps.zip ../module1")
os.system("zip -r9 deps.zip ../module2")

I don't know why a zip file created in python would be different than outside of python, but I've spent the last few days on this and hopefully someone can help!

Thanks!!


Solution

  • It turns out it was something fairly simple...

    Zipfile was saving the full filename with relative directory such as:

    ../module1/test.py
    

    spark is excepting the folders to be on the top level without that relative path like:

    module1/test.py
    

    I just had to change my write to be like this:

    with zipfile.ZipFile('deps.zip','w') as zipf:
            for file in file_paths:
                zipf.write(file,os.path.relpath(file,'..'))
    

    If you extract the original zip file you'd never see the names with the ../ in front. Shrug