I am trying to filter some data using the Python Core API, which is through Apache Spark, but I am coming into this error, and I am unable to solve it in terms of the data I have:
TypeError: tuple indices must be integers or slices, not str
Now, this is a sample of my data structure:
This is the code I am using to filter my data, but it keeps giving me that error. I am simply trying to return the business_id, city and stars from my dataset.
(my_rdd
.filter(lambda x: x['city']=='Toronto')
.map(lambda x: (x['business_id'], x['city'], x['stars']))
).take(5)
Any guidance on how to filter my data would be helpful.
Thanks.
Sinc your data is nested in tuples, you need to specify the tuple indices in your filter
and map
:
result = (my_rdd
.filter(lambda x: x[1][1]['city']=='Toronto')
.map(lambda x: (x[1][1]['business_id'], x[1][1]['city'], x[1][1]['stars']))
)
print(result.collect())
[('7v91woy8IpLrqXsRvxj_vw', 'Toronto', 3.0)]