I am trying to archive my application in my test file to spark submit on an EMR cluster like this:
Folder structure of modules:
app
--- module1
------ test.py
------ test2.py
--- module2
------ file1.py
------ file2.py
Zip function I'm calling from my tests
import zipfile
import os
def zip_deps():
# make zip
module1_path = '../module1'
module2_path = '../module2'
try:
with zipfile.ZipFile('deps.zip', 'w', zipfile.ZIP_DEFLATED) as zipf:
info = zipfile.ZipInfo(module1_path +'/')
zipf.writestr(info, '')
for root, dirs, files in os.walk(module1_path):
for d in dirs:
info = zipfile.ZipInfo(os.path.join(root, d)+'/')
zipf.writestr(info, '')
for file in files:
zipf.write(os.path.join(root, file),os.path.relpath(os.path.join(root, file)))
info = zipfile.ZipInfo(module2_path +'/')
zipf.writestr(info, '')
for root, dirs, files in os.walk(module2_path):
for d in dirs:
info = zipfile.ZipInfo(os.path.join(root, d)+'/')
zipf.writestr(info, '')
for file in files:
zipf.write(os.path.join(root, file),os.path.relpath(os.path.join(root, file)))
except:
print('Unexpected error occurred while creating file deps.zip')
zipf.close()
The deps.zip is created correctly, as far as I can see it zips all the files I want, and each module folder is at the base level of the zip.
In fact the exact zip created using:
zip -r deps.zip module1 module2
is the same structure and THIS works when I spark submit it with
spark-submit --py-files deps.zip driver.py
Error from EMR:
Traceback (most recent call last):
File "driver.py", line 6, in <module>
from module1.test import test_function
ModuleNotFoundError: No module named 'module1'
FWIW I also tried zipping using a subprocess with the following commands and I got the same error on my EMR in spark
os.system("zip -r9 deps.zip ../module1")
os.system("zip -r9 deps.zip ../module2")
I don't know why a zip file created in python would be different than outside of python, but I've spent the last few days on this and hopefully someone can help!
Thanks!!
It turns out it was something fairly simple...
Zipfile was saving the full filename with relative directory such as:
../module1/test.py
spark is excepting the folders to be on the top level without that relative path like:
module1/test.py
I just had to change my write to be like this:
with zipfile.ZipFile('deps.zip','w') as zipf:
for file in file_paths:
zipf.write(file,os.path.relpath(file,'..'))
If you extract the original zip file you'd never see the names with the ../
in front. Shrug