Search code examples
pythonapache-sparkpysparkapache-spark-sqlrdd

How do I add values from a list into each item of an RDD?


Say i have regular python list [1,2] and I have a rdd with 2 items like [('hi', 'bye'), ('hi', 'bye')] and I want each item to become

('hi', 'bye', 1)
('hi', 'bye', 2)

Essentially appending each item from the list to each item in the rdd. I feel like this should be simple but I can't think of the logic :/


Solution

  • You can use the zip method of RDD:

    rdd1 = sc.parallelize([('hi', 'bye'), ('hi', 'bye')])
    rdd2 = sc.parallelize([1, 2])
    
    rdd3 = rdd1.zip(rdd2).map(lambda x: (x[0][0], x[0][1], x[1]))
    
    rdd3.collect()
    # [('hi', 'bye', 1), ('hi', 'bye', 2)]