I want to remove punctuation from different kind of scripts, English, Arabic and so on if I used the normal way using pandas when reading the dataframe, for the English part it works fine but when there is script change, it will remove all the punctuation and anything which is not letters which I don't want, so is there a way using the same method str.replace to create your own list of punctuation to be used
I'm currently using this which removes all punctuation
dataframe['columnname'].str.replace('[^\w\s]', '')
but when I try to make the replace a list I want it does not work is there a way to create my own list something like that
dataframe['columnname'].str.replace(',,?, !, .,:, ;', '')
so anything with , ? ! . : ; ' will be removed and anything else outside this range will stay
Here is necessary espace some special regex characters like .
or ?
:
dataframe['columnname'] = dataframe['columnname'].str.replace("[,\?!\.:;']", '')
Or use re.escape
:
import re
pat = '[' + re.escape(",?!.:;'") + ']'
print (pat)
[,\?!\.:;']
dataframe['columnname'] = dataframe['columnname'].str.replace(pat, '')