Search code examples
graphapache-spark-sqlspark-graphxconnected-componentsgraphframes

efficiently calculating connected components in pyspark


I'm trying to find the connected components for friends in a city. My data is a list of edges with an attribute of city.

City | SRC | DEST

Houston Kyle -> Benny

Houston Benny -> Charles

Houston Charles -> Denny

Omaha Carol -> Brian

etc.

I know the connectedComponents function of pyspark's GraphX library will iterate over all the edges of a graph to find the connected components and I'd like to avoid that. How would I do so?

edit: I thought I could do something like

select connected_components(*) from dataframe groupby city

where connected_components generates a list of items.


Solution

  • Suppose your data is like this

    import org.apache.spark._
    import org.graphframes._
    
    val l = List(("Houston","Kyle","Benny"),("Houston","Benny","charles"),
                ("Houston","Charles","Denny"),("Omaha","carol","Brian"),
                ("Omaha","Brian","Daniel"),("Omaha","Sara","Marry"))
    var df = spark.createDataFrame(l).toDF("city","src","dst")
    

    Create a list of cities for which you want to run connected components cities = List("Houston","Omaha")

    Now run a filter on city column for every city in cities list, then create an edge and vertex dataframes from the resulting dataframe. Create a graphframe from these edge and vertices dataframes and run connected components algorithm

    val cities = List("Houston","Omaha")
    
    for(city <- cities){
        val edges = df.filter(df("city") === city).drop("city")
        val vert = edges.select("src").union(edges.select("dst")).
                         distinct.select(col("src").alias("id"))
        val g = GraphFrame(vert,edges)
        val res = g.connectedComponents.run()
        res.select("id", "component").orderBy("component").show()
    }
    

    Output

    |     id|   component|
    +-------+------------+
    |   Kyle|249108103168|
    |charles|249108103168|
    |  Benny|249108103168|
    |Charles|721554505728|
    |  Denny|721554505728|
    +-------+------------+
    
    +------+------------+                                                           
    |    id|   component|
    +------+------------+
    | Marry|858993459200|
    |  Sara|858993459200|
    | Brian|944892805120|
    | carol|944892805120|
    |Daniel|944892805120|
    +------+------------+