Search code examples
apache-sparkpysparkjupyter-notebookazure-hdinsight

Using Spark packages with Jupyter Notebook on HD Insight


I'm trying to use graphFrames on PySpark via a Jupyter notebook. My Spark cluster is on HD Insight, so I don't have access to edit kernel.json.

The solutions suggested [here][1] and [here][2] didn't work. This is what I tried to run:

import os
packages = "graphframes:graphframes:0.3.0-spark2.0" # -s_2.11
os.environ["PYSPARK_SUBMIT_ARGS"] = (
    "--packages {0} pyspark-shell".format(packages)
)
from graphframes import *

This resulted in an error that a module named graphframes doesn't exist. Is there a way to initiate a new SparkContext after changing this env variable?

I've also tried passing the PYSPARK_SUBMIT_ARGS variable to IPython via the %set_env magic command and then importing graphframes:

%set_env PYSPARK_SUBMIT_ARGS='--packages graphframes:graphframes:0.3.0-spark2.0-s_2.11 pyspark-shell'

from graphframes import *

But this resulted in the same error.

I saw some suggestions to pass the jar to IPython, but I'm not sure how to download the needed jar to my HD Insight cluster.

Do you have any suggestions?


Solution

  • It turns out I had two separate issues:

    1) I was using the wrong syntax to configure the notebook. you should use:

    # For HDInsight 3.3 and HDInsight 3.4
    %%configure 
    { "packages":["com.databricks:spark-csv_2.10:1.4.0"] }
    
    # For HDInsight 3.5
    %%configure 
    { "conf": {"spark.jars.packages": "com.databricks:spark-csv_2.10:1.4.0" }}
    

    Here are the relevant docs from Microsoft.

    2) According to this useful answer, there seems to be bug in Spark that causes it to miss the package's jar. This worked for me:

    sc.addPyFile(os.path.expanduser('./graphframes_graphframes-0.3.0-spark2.0-s_2.11.jar'))