Search code examples
pythonmatplotlibpysparkpython-importlibraries

How to import matplotlib python library in pyspark using sc.addPyFile()?


I am using spark on python both iteratively launching the command pyspark from Terminal and also launching an entire script with the command spark-submit pythonFile.py

I am using to analyze a local csv file, so no distributed computation is performed.

I would like to use the library matplotlib to plot columns of a dataframe. When importing matplotlib I get the error ImportError: No module named matplotlib. Then I came across this question and tried the command sc.addPyFile() but you could not find any file relating to matplotlib that I can pass to it on my OS (OSX).

For this reason I created a virtual environment and installed matplotlib with it. Navigating through the virtual environment I saw there was no file such as marplotlib.py so I tried to pass it the entire folder sc.addPyFile("venv//lib/python3.7/site-packages/matplotlib") but again no success.

I do not know which file I should include or how at this point and I ran out of ideas.

Is there a simple way to import matplotlib library inside spark (installing with virtualenv or referencing the OS installation)? And if so, which *.py files I should pass the command sc.addPyFile()

Again I am not interested in distributed computation: the python code will run only locally on my machine.


Solution

  • I will post what I have done. First of all I am working with virtualenv. So I created a new one with virtualenv path.

    Then I activated it with source path/bin/activate.

    I installed the packages I needed with pip3 install packageName.

    After that I created a script in python that creates a zip archive of the libraries installed with virtualenv in the path ./path/lib/python3.7/site-packages/.

    The code of this script is the following (it is zipping only numpy):

    import zipfile
    import os
    
    #function to archive a single package
    def ziplib(general_path, libName):
    
       libpath = os.path.dirname(general_path + libName)      # this should point to your packages directory 
       zippath = libName  + '.zip'      # some random filename in writable directory
       zf = zipfile.PyZipFile(zippath, mode='w')
       try:
           zf.debug = 3             # making it verbose, good for debugging 
           zf.writepy(libpath)
           return zippath           # return path to generated zip archive
       finally:
           zf.close()
    
    
    general_path = './path//lib/python3.7/site-packages/'
    matplotlib_name = 'matplotlib'
    seaborn_name = 'seaborn'
    numpy_name = 'numpy'
    zip_path = ziplib(general_path, numpy_name)      # generate zip archive containing your lib                            
    print(zip_path)
    

    After that the archives must be referenced in the pyspark file myPyspark.py. You do this by calling the method addPyFile() of the sparkContext class. After that you just can import in your code as always. In my case I did the following:

    from pyspark import SparkContext
    sc = SparkContext.getOrCreate()
    sc.addPyFile("matplot.zip") #generate with testZip.py
    sc.addPyFile("numpy.zip") #generate with testZip.py
    import matplotlib
    import numpy
    

    When you are launching the script you have to reference the zip archives in the command with --py-files. For example:

    sudo spark-submit --py-files matplot.zip --py-files numpy.zip myPyspark.py
    

    I considered two archives because to me it was clear how to import one but not two of them.