Search code examples
pythonapache-sparkdictionarypysparkrdd

Checking items in a list against a pyspark RDD


I have the following pyspark RDD with Ids and their counts:

rdd = [('12', 560), ('34', 900), ('56', 800), ('78', 100), ('910', 220), ('125', 410), ('111', 41), etc.]

And I have a regular list:

id_list = ['12', '125', '78']

I want a new list of key, value pairs of 'id' from id_list and 'counts' from rdd.

So expected output:

new_list = [('12', 560), ('125', 410), ('78', 100)]

If rdd was a python dictionary, I could loop over the id_list, check to see if it's in the dictionary and return a new list with key and counts. But I'm lost on how I could do this with an RDD. Please advise.

I could potentially try to convert the RDD into a dictionary but that would defeat the purpose of using spark.


Solution

  • You can filter the RDD using a lambda function which checks if the key is in id_list:

    rdd2 = rdd.filter(lambda x: x[0] in id_list)