I have an RDD of tuples with the format (List,Integer) in PySpark.
Example:
(["Hello","How","are","you"],12)
I want to convert this to an RDD of type
("Hello",12),
("How",12),
("are",12),
("you",12)
You can use flatMap
:
rdd.collect()
# [(['Hello', 'How', 'are', 'you'], 12)]
rdd2 = rdd.flatMap(lambda r: [(i, r[1]) for i in r[0]])
rdd2.collect()
# [('Hello', 12), ('How', 12), ('are', 12), ('you', 12)]