Search code examples
scalaapache-sparkspark-graphx

How to create links between vertices in RDD[(Long, Vertex)] based on a property?


I have a users: RDD[(Long, Vertex)] collection of users. I want to create links between my Vertex objects. The rule is: if two Vertex have the same value in a selected property - call it prop1, then a link exists.

My problem is how to check for every pair in the same collection. If I do:

val rels = users.map(
  x => users.map(y => if(x._2.prop1 == y._2.prop1){(x._1, y._1)}))

I got back an RDD[RDD[Any]] and not a RDD[(Long, Long)] as expected for the Graph to work


Solution

  • First of all you cannot start an action of a transformation from an another action or transformation not to mention create nested RDDs. So it is simply impossible you get RDD[RDD[Any]].

    What you need here is most likely a simple join roughly equivalent to something like this where T is a type of the property1:

    val pairs: RDD[(T, Long)] = users.map{ case (id, v) => (v.prop1, id) }
    val links: RDD[(Long, Long)] = pairs
      .join(pairs)  // join by a common property, equivalent to INNER JOIN in SQL
      .values  // drop properties
      .filter{ case (v1, v2) => v1 != v2 }  // filter self-links