Search code examples
pythonapache-sparkpysparktuplesrdd

Remove empty strings from a tuple RDD


I have a RDD of the form (name,[token1, token2, ...]) with name being the key and token being the values. For example: (Robert,['hello', 'movie', '', 'cinema']) and I would like to remove the empty strings in the values using map.

My attempt was:

new_tuple = tuple.map(lambda x: (x[0], [s for s in x[1] if len(s)>0]))

to obtain (Robert,['hello', 'movie', 'cinema'])

But I feel like there is a less redundant way of doing it?

After that, I want to remove items that might end up without any values (tokens) after my above operation, would the following work?:

final_tuple = new_tuple.filter(lambda x: len(x[1])>0)

Solution

  • Try this one: a = (Robert,['hello', 'movie', '', 'cinema'])

    then a = (a[0], list(filter(None, a[1])))

    This is the best way to remove None, False, 0, "", '' from sequence