Search code examples
pythonapache-sparkfilterpysparkrdd

Filter RDD of key/value pairs based on value equality in PySpark


Given

[('Project', 10),
 ("Alice's", 11),
 ('in', 401),
 ('Wonderland,', 3),
 ('Lewis', 10),
 ('Carroll', 4),
 ('', 2238),
 ('is', 10),
 ('use', 24),
 ('of', 596),
 ('anyone', 4),
 ('anywhere', 3),

in which the value of the paired RDD is the word frequency.

I would only like to return the words that appear 10 times. Expected output

 [('Project', 10),
   ('Lewis', 10),
   ('is', 10)]

I tried using

rdd.filter(lambda words: (words,10)).collect()

But it still shows the entire list. How should I go about this?


Solution

  • Your lambda function is wrong; It should be

    rdd.filter(lambda words: words[1] == 10).collect()
    

    For example,

    my_rdd = sc.parallelize([('Project', 10), ("Alice's", 11), ('in', 401), ('Wonderland,', 3), ('Lewis', 10)], ('is', 10)]
    
    >>> my_rdd.filter(lambda w: w[1] == 10).collect()
    [('Project', 10), ('Lewis', 10), ('is', 10)]