I want make my pyspark code so that it could remove the punctuation from a dataframe column. My code is like:
def split(x):
punc = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
x = x.replace(punc, ' ')
return x
Result:
id |
+--------
|187.080/B1
It's supposed to remove all the punctuations but Im not sure what should I edit to make it works?
First of all you need to register your function as an UDF for using that way. Although, the replace statement is not working because is trying to match the entire string punc, which does not appear in your value. You can use regular expressions or iterate over punc string, replacing each character (I think second method is faster):
def split(value):
punc = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
for ch in punc:
value = value.replace(ch, ' ')
value = value.replace(' ','')
return value
Just for performance notes, always try to search if a similar function is implemented in pyspark module (pyspark.sql.functions) because they are always much faster than UDFs.