Search code examples
pythonapache-sparkpysparkrdd

is there a trim() function for RDDs?


To remove leading and trailing whitespaces, I know you can use trim on dataframes. Is there a similar function when using RDDs? If not, how would you do this?


Edit: Added some code:

nonNullRDD = marchRDD.filter(lambda row: row.title).filter(lambda row: row.authors)
titleRDD = nonNullRDD.map(lambda field: (field.title, field.authors))
splitRDD = titleRDD.flatMap(lambda field: [(field[0], z) for z in field[1].split(";")])
authorRDD = splitRDD.map(lambda field: [field[1], 1])
test = authorRDD.flatMap(lambda word: word.strip())

Solution

  • RDD's don't have string functions

    I believe you're looking for Python str.strip()

    trimmed_words = words.map(lambda word: word.strip())