Search code examples
pythonapache-sparkpysparkrdd

Flatten list within RDD of tuples with type (List,Integer)


I have an RDD of tuples with the format (List,Integer) in PySpark.

Example:

(["Hello","How","are","you"],12)

I want to convert this to an RDD of type

("Hello",12),
("How",12),
("are",12),
("you",12)

Solution

  • You can use flatMap:

    rdd.collect()
    # [(['Hello', 'How', 'are', 'you'], 12)]
    
    rdd2 = rdd.flatMap(lambda r: [(i, r[1]) for i in r[0]])
    
    rdd2.collect()
    # [('Hello', 12), ('How', 12), ('are', 12), ('you', 12)]