Search code examples
pysparklowercasepunctuation

Pyspark how to remove punctuation marks and make lowercase letters in Rdd?


I would like to remove punctuation mark and make the lowercase letters in RDD? Below is my data set

 l=sc.parallelize(["How are you","Hello\ then% you"\
,"I think he's fine+ COMING"])

I tried below function but I got an error message

punc='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

def lower_clean_str(x):
    lowercased_str = x.lower()
    clean_str = lowercased_str.translate(punc) 
    return clean_str

one_RDD = l.flatMap(lambda x: lower_clean_str(x).split())
one_RDD.collect()

But this gives me an error. What might be the problem? How can I fix this? Thank you.


Solution

  • You are using the python translate function in a wrong way. As I am not sure if you are using python 2.7 or python 3, I am suggesting an alternate approach.

    The translate function changes a bit in python 3.

    The following code will work irrespective of the python version.

    def lower_clean_str(x):
      punc='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
      lowercased_str = x.lower()
      for ch in punc:
        lowercased_str = lowercased_str.replace(ch, '')
      return lowercased_str
    
    l=sc.parallelize(["How are you","Hello\ then% you","I think he's fine+ COMING"])
    one_RDD = l.map(lower_clean_str)
    one_RDD.collect()
    

    Output :

    ['how are you', 'hello then you', 'i think hes fine coming']