I have the following pyspark RDD with Ids and their counts:
rdd = [('12', 560), ('34', 900), ('56', 800), ('78', 100), ('910', 220), ('125', 410), ('111', 41), etc.]
And I have a regular list:
id_list = ['12', '125', '78']
I want a new list of key, value pairs of 'id' from id_list and 'counts' from rdd.
So expected output:
new_list = [('12', 560), ('125', 410), ('78', 100)]
If rdd was a python dictionary, I could loop over the id_list, check to see if it's in the dictionary and return a new list with key and counts. But I'm lost on how I could do this with an RDD. Please advise.
I could potentially try to convert the RDD into a dictionary but that would defeat the purpose of using spark.
You can filter the RDD using a lambda function which checks if the key is in id_list
:
rdd2 = rdd.filter(lambda x: x[0] in id_list)