Search code examples
pythonapache-sparkpysparkapache-spark-sqlgraphframes

Convert GraphFrames ShortestPath Map into DataFrame rows in PySpark


I am trying to find the most efficient way to take the Map output from the GraphFrames function shortestPaths and flatten each vertex's distances map into individual rows in a new DataFrame. I've been able to do it very clumsily by pulling the distances column into a dictionary and then convert from there into a pandas dataframe and then converting back to a Spark dataframe, but I know there must be a better way.

from graphframes import *

v = sqlContext.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
], ["id", "name", "age"])

# Create an Edge DataFrame with "src" and "dst" columns
e = sqlContext.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
], ["src", "dst", "relationship"])

# Create a GraphFrame
g = GraphFrame(v, e)

results = g.shortestPaths(landmarks=["a", "b","c"])
results.select("id","distances").show()

+---+--------------------+
| id|           distances|
+---+--------------------+
|  a|Map(a -> 0, b -> ...|
|  b| Map(b -> 0, c -> 1)|
|  c| Map(c -> 0, b -> 1)|
+---+--------------------+

What I want is to take the output above and flatten the distances while keeping the ids into something like this:

+---+---+---------+      
| id| v | distance|
+---+---+---------+
|  a| a | 0       |
|  a| b | 1       |
|  a| c | 2       |
|  b| b | 0       |
|  b| c | 1       |
|  c| c | 0       |
|  c| b | 1       |
+---+---+---------+ 

Thanks.


Solution

  • You can explode:

    >>> from pyspark.sql.functions import explode
    >>> results.select("id", explode("distances"))