Search code examples
pythonpysparkrdd

How to perform vlook up in spark rdd


I have two rdd

rdd1 =[('1', 3428), ('2', 2991), ('3', 2990), ('4', 2883), ('5', 2672), ('5', 2653)]
rdd2 = [['1', 'Toy Story (1995)'], ['2', 'Jumanji (1995)'], ['3', 'Grumpier Old Men (1995)']]

I want to perform an operation to relace first rdd's first element with second rdd's second element

My final result will be like this

[(''Toy Story (1995)'', 3428), ('Jumanji (1995)', 2991), ('Grumpier Old Men (1995)', 2990)]

Please refer me a way to perform this


Solution

  • Use join and map:

    rdd1.join(rdd2).map(lambda x: (x[1][1], x[1][0])).collect()
    #[('Toy Story (1995)', 3428),
    # ('Jumanji (1995)', 2991),
    # ('Grumpier Old Men (1995)', 2990)]