Search code examples
pysparkgraphframes

Importing PySpark packages


I have downloaded the graphframes package (from here) and saved it on my local disk. Now, I would like to use it. So, I use the following command:

IPYTHON_OPTS="notebook --no-browser" pyspark --num-executors=4  --name gorelikboris_notebook_1  --py-files ~/temp/graphframes-0.1.0-spark1.5.jar --jars ~/temp/graphframes-0.1.0-spark1.5.jar --packages graphframes:graphframes:0.1.0-spark1.5

All the pyspark functionality works as expected, except for the new graphframes package: whenever I try to import graphframes, I get an ImportError. When I examine sys.path, I can see the following two paths:

/tmp/spark-1eXXX/userFiles-9XXX/graphframes_graphframes-0.1.0-spark1.5.jar and /tmp/spark-1eXXX/userFiles-9XXX/graphframes-0.1.0-spark1.5.jar, however these files don't exist. Moreover, the /tmp/spark-1eXXX/userFiles-9XXX/ directory is empty.

What am I missing?


Solution

  • This might be an issue in Spark packages with Python in general. Someone else was asking about it too earlier on the Spark user discussion alias.

    My workaround is to unpackage the jar to find the python code embedded, and then move the python code into a subdirectory called graphframes.

    For instance, I run pyspark from my home directory

    ~$ ls -lart
    drwxr-xr-x 2 user user   4096 Feb 24 19:55 graphframes
    
    ~$ ls graphframes/
    __init__.pyc  examples.pyc  graphframe.pyc  tests.pyc
    

    You would not need the py-files or jars parameters, though, something like

    IPYTHON_OPTS="notebook --no-browser" pyspark --num-executors=4 --name gorelikboris_notebook_1 --packages graphframes:graphframes:0.1.0-spark1.5

    and having the python code in the graphframes directory should work.