Search code examples
scalaapache-sparkdataframedatasetspark-graphx

Spark- GraphFrames How to use the component ID in connectedComponents


I'm trying to find all the connected components(in this example, 4 is connected to 100, 2 is connected to 200 etc.) I used val g2 = GraphFrame(v2, e2) val result2 = g2.connectedComponents.run() and that returns nodes with a component ID. My problem is, how do I use this ID to see all the connected nodes? How to find out which node this id belongs to? Many thanks. I'm quite new to this.

val v2 = sqlContext.createDataFrame(List(
         ("a",1),
        ("b", 2),
        ("c", 3),
        ("d", 4),
       ("e", 100),
        ("f", 200),
        ("g", 300),
         ("h", 400)
  )).toDF("nodes", "id")


val e2= sqlContext.createDataFrame(List(
         (4,100, "friend"),
         (2, 200, "follow"),
         (3, 300, "follow"),
         (4, 400, "follow"),
         (1, 100, "follow"),
          (1,400, "friend")

  )).toDF("src", "dst", "relationship")

In this example I'm expected to see the connections below

----+----+
|   4|   400|
|   4|   100|
|   1|   400|
|   1|   100|

This is what the result shows now
(1,1),(2,2),(3,1),(4,1), (100,1) (200,2) (300,3)(400,1). How do I see all the connections?


Solution

  • You have declared "a", "b", "c"... to be your graph's node ids, but later used 1, 2, 3... as node ids to define edges.

    You should change the node ids to the numbers: 1,2,3.. while creating the vertices dataframe, by naming that column as "id" :

    val v2 = sqlContext.createDataFrame(List(
             ("a",1),
            ("b", 2),
            ("c", 3),
            ("d", 4),
           ("e", 100),
            ("f", 200),
            ("g", 300),
             ("h", 400)
      )).toDF("nodes", "id")
    

    That should give you the desired results.