I am submitting my Spark job using spark-submit
CLI with --py-files
(wheel file) as an argument. I want to list all the packages included in the wheel file either in Driver or executor side. How can I find that ? Tried with spark.sparkContext._jsc.sc().listJars()
, however its giving only the java jars, not the python packages.
You can list the .whl files submitted through --py-files by accessing the SparkFiles root directory. Here's a minimal example to achieve that. Assuming this is your spark-submit,
spark-submit \
--master "local[4]" \
--py-files "/Users/lol/code/pyfilestest/whl-0.0.4-py2.py3-none-any.whl,\
/Users/lol/code/pyfilestest/dir2path-0.1.0-py3-none-any.whl" \
list_wheels.py
(in this example case i have added dir2path-0.1.0-py3-none-any.whl
and whl-0.0.4-py2.py3-none-any.whl
)
This script, list_wheels.py
that is running via spark-submit will print the list of .whl files submitted through --py-files, enabling you to confirm what packages have been uploaded.
from pyspark import SparkContext, SparkFiles
import os
import glob
if __name__ == '__main__':
sc = SparkContext()
# List all wheel files in SparkFiles root directory
sparkfiles_dir = SparkFiles.getRootDirectory()
wheel_files_with_path = glob.glob(os.path.join(sparkfiles_dir, '*.whl'))
# Get only the file names
wheel_files = [os.path.basename(file) for file in wheel_files_with_path]
print("wheel_files:",wheel_files)
End result:
a lot of verbose logging and at the end:
wheel_files:['dir2path-0.1.0-py3-none-any.whl', 'whl-0.0.4-py2.py3-none-any.whl']
NB. In case you want to see what is in the wheel files as well then you can change your script to:
import os
import glob
from zipfile import ZipFile
from pyspark import SparkContext, SparkFiles
if __name__ == '__main__':
sc = SparkContext()
# List all wheel files in SparkFiles root directory
sparkfiles_dir = SparkFiles.getRootDirectory()
wheel_files_with_path = glob.glob(os.path.join(sparkfiles_dir, '*.whl'))
for wheel_file in wheel_files_with_path:
print(f"Listing contents of {os.path.basename(wheel_file)}:")
# Open the wheel file as a ZIP archive and list its contents
with ZipFile(wheel_file, 'r') as zip_ref:
for filename in zip_ref.namelist():
print(f" - {filename}")