Search code examples
pythonapache-sparkpysparkrdd

Remove RDD values with condition


I have an RDD like this:

[ (Person 1, [Cat, Dog, Cow]), (Person 2, [Cat]), (Person 3,[Cow, Chicken])]

And I have a list of frequent animals:

freq_animals=[Cat, Dog]

I want to delete in my RDD the values for each person that are not in the list of frequent animals i.e. Output would be:

[ (Person 1, [Cat, Dog]), (Person 2, [Cat]), (Person 3,[])]

Any idea how I could change my RDD? Thank you!


Solution

  • You can do mapValues using a list comprehension:

    rdd = sc.parallelize([("Person 1", ["Cat", "Dog", "Cow"]), ("Person 2", ["Cat"]), ("Person 3", ["Cow", "Chicken"])])
    
    freq_animals = ["Cat", "Dog"]
    
    rdd2 = rdd.mapValues(lambda v: [i for i in v if i in freq_animals])
    
    print(rdd2.collect())
    # [('Person 1', ['Cat', 'Dog']), ('Person 2', ['Cat']), ('Person 3', [])]