Search code examples
pysparkgoogle-cloud-dataprocgraphframes

PySpark exception with GraphFrames


I am building a simple Network Graph with PySpark and GraphFrames (running on Google Dataproc)

vertices = spark.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
  ("d", "David", 29),
  ("e", "Esther", 32),
  ("f", "Fanny", 36),
  ("g", "Gabby", 60)], 

    ["id", "name", "age"])

edges = spark.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
  ("f", "c", "follow"),
  ("e", "f", "follow"),
  ("e", "d", "friend"),
  ("d", "a", "friend"),
  ("a", "e", "friend")
], ["src", "dst", "relationship"])

g = GraphFrame(vertices, edges)

Then, I try to run `label progation'

result = g.labelPropagation(maxIter=5)

But I get the following error:

Py4JJavaError: An error occurred while calling o164.run.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 19.0 failed 4 times, most recent failure: Lost task 0.3 in stage 19.0 (TID 829, cluster-network-graph-w-12.c.myproject-bi.internal, executor 2): java.lang.ClassNotFoundException: org.graphframes.GraphFrame$$anonfun$5

It looks like the package 'GraphFrame' isn't available - but only if I run label propagation. How can I fix it?


Solution

  • I have solved using the following parameters

    import pyspark
    from pyspark.sql import SparkSession
    
    conf = pyspark.SparkConf().setAll([('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest.jar'),
                                       ('spark.jars.packages', 'graphframes:graphframes:0.7.0-spark2.3-s_2.11')])
    
    spark = SparkSession.builder \
      .appName('testing bq')\
      .config(conf=conf) \
      .getOrCreate()