Search code examples
apache-sparkazure-hdinsightspark-graphxgraphframes

how to use graphframes inside SPARK on HDInsight cluster


I have setup an SPARK cluster on HDInsight and was am trying to use GraphFrames using this tutorial.

I have already used the custom scripts during the cluster creation to enable the GraphX on the spark cluster as described here.

When I am running the notepad,

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

import org.graphframes._

i get the following error

<console>:45: error: object graphframes is not a member of package org
       import org.graphframes._
                  ^

I tried to install the graphframes from the spark terminal via Jupyter using the following command:

$SPARK_HOME/bin/spark-shell --packages graphframes:graphframes:0.1.0-spark1.5

but Still I am unable to get it working. I am new to Spark and HDInsight so can someone please point out what else I need to install on this cluster to get this working.


Solution

  • Today, this works in spark-shell, but doesn't work in jupyter notebook. So when you run this: $SPARK_HOME/bin/spark-shell --packages graphframes:graphframes:0.1.0-spark1.5 It works (at least on spark 1.6 cluster version) in the context of this spark-shell session. But in jupyter there is currently no way to load packages. This feature is going to be added soon to jupyter notebooks in the clusters. In the meantime you can use spark-shell, or spark-submit, etc.