Search code examples
apache-sparkgraphframes

Does GraphFrames api support creation of Bipartite graphs?


Does GraphFrames api support creation of Bipartite graphs in the current version?

Current version: 0.1.0

Spark version : 1.6.1


Solution

  • As pointed out in the comments to this question, neither GraphFrames nor GraphX have built-in support for bipartite graphs. However, they both have more than enough flexibility to let you create bipartite graphs. For a GraphX solution, see this previous answer. That solution uses a shared trait between the different vertex / object type. And while that works with RDDs that's not going to work for DataFrames. A row in a DataFrame has a fixed schema -- it can't sometimes contain a price column and sometimes not. It can have a price column that's sometimes null, but the column has to exist in every row.

    Instead, the solution for GraphFrames seems to be that you need to define a DataFrame that's essentially a linear sub-type of both types of objects in your bipartite graph -- it has to contain all of the fields of both types of objects. This is actually pretty easy -- a join with full_outer is going to give you that. Something like this:

    val players = Seq(
      (1,"dave", 34),
      (2,"griffin", 44)
    ).toDF("id", "name", "age")
    
    val teams = Seq(
      (101,"lions","7-1"),
      (102,"tigers","5-3"),
      (103,"bears","0-9")
    ).toDF("id","team","record")
    

    You could then create a super-set DataFrame like this:

    val teamPlayer = players.withColumnRenamed("id", "l_id").join(
      teams.withColumnRenamed("id", "r_id"),
      $"r_id" === $"l_id", "full_outer"
    ).withColumn("l_id", coalesce($"l_id", $"r_id"))
     .drop($"r_id")
     .withColumnRenamed("l_id", "id")
    
    teamPlayer.show
    
    +---+-------+----+------+------+
    | id|   name| age|  team|record|
    +---+-------+----+------+------+
    |101|   null|null| lions|   7-1|
    |102|   null|null|tigers|   5-3|
    |103|   null|null| bears|   0-9|
    |  1|   dave|  34|  null|  null|
    |  2|griffin|  44|  null|  null|
    +---+-------+----+------+------+
    

    You could possibly do it a little cleaner with structs:

    val tpStructs = players.select($"id" as "l_id", struct($"name", $"age") as "player").join(
      teams.select($"id" as "r_id", struct($"team",$"record") as "team"),
      $"l_id" === $"r_id",
      "full_outer"
    ).withColumn("l_id", coalesce($"l_id", $"r_id"))
     .drop($"r_id")
     .withColumnRenamed("l_id", "id")
    
    tpStructs.show
    
    +---+------------+------------+
    | id|      player|        team|
    +---+------------+------------+
    |101|        null| [lions,7-1]|
    |102|        null|[tigers,5-3]|
    |103|        null| [bears,0-9]|
    |  1|   [dave,34]|        null|
    |  2|[griffin,44]|        null|
    +---+------------+------------+
    

    I'll also point out that more or less the same solution would work in GraphX with RDDs. You could always create a vertex via joining two case classes that don't share any traits:

    case class Player(name: String, age: Int)
    val playerRdd = sc.parallelize(Seq(
      (1L, Player("date", 34)),
      (2L, Player("griffin", 44))
    ))
    
    case class Team(team: String, record: String)
    val teamRdd = sc.parallelize(Seq(
      (101L, Team("lions", "7-1")),
      (102L, Team("tigers", "5-3")),
      (103L, Team("bears", "0-9"))
    ))
    
    playerRdd.fullOuterJoin(teamRdd).collect foreach println
    (101,(None,Some(Team(lions,7-1))))
    (1,(Some(Player(date,34)),None))
    (102,(None,Some(Team(tigers,5-3))))
    (2,(Some(Player(griffin,44)),None))
    (103,(None,Some(Team(bears,0-9))))
    

    With all respect to the previous answer, this seems like a more flexible way to handle it -- without having to share a trait between the combined objects.