Search code examples
apache-sparkpysparkapache-zeppelingraphframes

How to add graphframes to Apache Zeppelin


I am trying to use the graphframes library on Apache Zeppelin with the Spark (pyspark) interpreter, however, I keep on getting the error: ModuleNotFoundError: No module named 'graphframes' whenever I try to import the graphframes module using from graphframes import *.

I have tried adding the --packages 'graphframes:graphframes:0.7.0-spark2.4-s_2.11' directive in the zeppelin-env.sh file, I tried using the z.load('graphframes:graphframes:0.7.0-spark2.4-s_2.11') function, and I tried adding graphframes as a dependency in the interpreter setting, however, none of these attempts have worked.

I have also tried adding a spark repository to Zeppelin and then adding the maven coordinates for graphframes to the interpreter on zeppelin under the dependencies section. However, this did not work either.

I am using spark version 2.4 with scala 2.11 on zeppelin 0.8.1 hosted on an EMR cluster.

I am able to use graphframes from the terminal using pyspark and the --packages directive mentioned above, so this seems to be a zeppelin related issue.

I am stumped as to what I might do further. Any ideas on how I can get graphframes to work on zeppelin?


Solution

  • I think the problem is the your PYTHONPATH in Zeppelin. You can see the PYTHONPATH with:

    import sys
    print(sys.path)
    

    It works with the pyspark console because the package will be installed in a location which is already part of the PYTHONPATH. You can cheack that with:

    import graphframes
    print(graphframes.__file__)
    

    So all you have to do is to ad the package to your PYTHONPATH. Add the following line to /etc/spark/conf/spark-defaults.conf (other ways like the --packages parameter as SPARK_SUBMIT_OPTIONS should work as well):

    spark.jars.packages graphframes:graphframes:0.7.0-spark2.4-s_2.11

    After that you should add to /etc/spark/conf/spark-env.sh the following line to extend your PYTHONPATH (check the package location): export PYTHONPATH=$PYTHONPATH:/var/lib/zeppelin/.ivy2/jars/graphframes_graphframes-0.7.0-spark2.4-s_2.11.jar

    Restart your the spark interpreter in zeppelin to make sure that all changes are applied.