Search code examples
apache-sparkpysparkpython-packaging

List Python Packages included in a wheel in a PySpark Job


I am submitting my Spark job using spark-submit CLI with --py-files (wheel file) as an argument. I want to list all the packages included in the wheel file either in Driver or executor side. How can I find that ? Tried with spark.sparkContext._jsc.sc().listJars(), however its giving only the java jars, not the python packages.


Solution

  • You can list the .whl files submitted through --py-files by accessing the SparkFiles root directory. Here's a minimal example to achieve that. Assuming this is your spark-submit,

    spark-submit \
      --master "local[4]" \
      --py-files "/Users/lol/code/pyfilestest/whl-0.0.4-py2.py3-none-any.whl,\
    /Users/lol/code/pyfilestest/dir2path-0.1.0-py3-none-any.whl" \
      list_wheels.py
    

    (in this example case i have added dir2path-0.1.0-py3-none-any.whl and whl-0.0.4-py2.py3-none-any.whl)

    This script, list_wheels.pythat is running via spark-submit will print the list of .whl files submitted through --py-files, enabling you to confirm what packages have been uploaded.

    from pyspark import SparkContext, SparkFiles
    import os
    import glob
    
    if __name__ == '__main__':
        sc = SparkContext()
    
        # List all wheel files in SparkFiles root directory
        sparkfiles_dir = SparkFiles.getRootDirectory()
        wheel_files_with_path = glob.glob(os.path.join(sparkfiles_dir, '*.whl')) 
        # Get only the file names 
        wheel_files = [os.path.basename(file) for file in wheel_files_with_path]
        print("wheel_files:",wheel_files)
    

    End result:

    a lot of verbose logging and at the end:

    wheel_files:['dir2path-0.1.0-py3-none-any.whl', 'whl-0.0.4-py2.py3-none-any.whl']
    

    NB. In case you want to see what is in the wheel files as well then you can change your script to:

    import os
    import glob
    from zipfile import ZipFile
    from pyspark import SparkContext, SparkFiles
    
    if __name__ == '__main__':
        sc = SparkContext()
        
        # List all wheel files in SparkFiles root directory
        sparkfiles_dir = SparkFiles.getRootDirectory()
        wheel_files_with_path = glob.glob(os.path.join(sparkfiles_dir, '*.whl'))
    
        for wheel_file in wheel_files_with_path:
            print(f"Listing contents of {os.path.basename(wheel_file)}:")
            
            # Open the wheel file as a ZIP archive and list its contents
            with ZipFile(wheel_file, 'r') as zip_ref:
                for filename in zip_ref.namelist():
                    print(f"  - {filename}")