Search code examples
pythonapache-sparkhadooppysparkrdd

TypeError: tuple indices must be integers or slices, not str using Python Core API?


I am trying to filter some data using the Python Core API, which is through Apache Spark, but I am coming into this error, and I am unable to solve it in terms of the data I have:

TypeError: tuple indices must be integers or slices, not str

Now, this is a sample of my data structure:

This is the code I am using to filter my data, but it keeps giving me that error. I am simply trying to return the business_id, city and stars from my dataset.

(my_rdd
    .filter(lambda x: x['city']=='Toronto')
    .map(lambda x: (x['business_id'], x['city'], x['stars']))
).take(5)

Any guidance on how to filter my data would be helpful.

Thanks.


Solution

  • Sinc your data is nested in tuples, you need to specify the tuple indices in your filter and map:

    result = (my_rdd
        .filter(lambda x: x[1][1]['city']=='Toronto')
        .map(lambda x: (x[1][1]['business_id'], x[1][1]['city'], x[1][1]['stars']))
    )
    
    print(result.collect())
    [('7v91woy8IpLrqXsRvxj_vw', 'Toronto', 3.0)]