I have an rdd containing the values below.
rdd_2 = sc.parallelize([('f3.txt', 'of', 0.0),
('f3.txt',
'no',
0.00023241396735284342),
('f3.txt',
'may',
0.00042318717429693387),
('f3.txt',
'love',
0.00036660747046705975),
('f3.txt',
'romantic',
0.00022935755451437367)])
I wish to filter this RDD by the words ('romantic', 'love')
using a lambda function such that my resulting output is:
([('f3.txt', 'of', 0),
('f3.txt',
'no',
0),
('f3.txt',
'may',
0),
('f3.txt',
'love',
1),
('f3.txt',
'romantic',
1)])
I have tried the following code but i get an error:
querylist = ['romantic', 'love']
q = rdd_2.map(lambda x : x[2]=1 if x[1] not in querylist else x[2]=0)
SyntaxError: invalid syntax
What should i do?
You can not assign values like that in a lambda function. Instead return a new object containing the modified values.
Try this:
querylist = ['romantic', 'love']
q = rdd_2.map(lambda x : (x[0], x[1], 1 if x[1] not in querylist else 0))
Or equivalently;
q = rdd_2.map(lambda x : (x[0], x[1], int(x[1] not in querylist)))