Search code examples
pythonapache-sparkpysparkspark-graphx

How to create pair RDD with elements that share keys in source RDD?


I have a key-value RDD in pyspark and would like to return an RDD of pairs that have the same key in the source RDD.

#input rdd of id and user
rdd1 = sc.parallelize([(1, "user1"), (1, "user2"), (2, "user1"), (2, "user3"), (3,"user2"), (3,"user4"), (3,"user1")])

#desired output
[("user1","user2"),("user1","user3"),("user1","user4"),("user2","user4")]

So far I have been unable to come up with the correct combination of functions to do this. The purpose of this is to create an edge list of users based off of a shared common key.


Solution

  • As far as I understand your description something like this should work:

    output = (rdd1
       .groupByKey()
       .mapValues(set)
       .flatMap(lambda kvs: [(x, y) for x in kvs[1] for y in kvs[1] if x < y])
       .distinct())
    

    Unfortunately it is rather expensive operation.