Given
[('Project', 10),
("Alice's", 11),
('in', 401),
('Wonderland,', 3),
('Lewis', 10),
('Carroll', 4),
('', 2238),
('is', 10),
('use', 24),
('of', 596),
('anyone', 4),
('anywhere', 3),
in which the value of the paired RDD is the word frequency.
I would only like to return the words that appear 10 times. Expected output
[('Project', 10),
('Lewis', 10),
('is', 10)]
I tried using
rdd.filter(lambda words: (words,10)).collect()
But it still shows the entire list. How should I go about this?
Your lambda function is wrong; It should be
rdd.filter(lambda words: words[1] == 10).collect()
For example,
my_rdd = sc.parallelize([('Project', 10), ("Alice's", 11), ('in', 401), ('Wonderland,', 3), ('Lewis', 10)], ('is', 10)]
>>> my_rdd.filter(lambda w: w[1] == 10).collect()
[('Project', 10), ('Lewis', 10), ('is', 10)]