Search code examples
pythonpandasdataframepunctuation

how to create your own list of punctuation to be removed in python


I want to remove punctuation from different kind of scripts, English, Arabic and so on if I used the normal way using pandas when reading the dataframe, for the English part it works fine but when there is script change, it will remove all the punctuation and anything which is not letters which I don't want, so is there a way using the same method str.replace to create your own list of punctuation to be used

I'm currently using this which removes all punctuation

dataframe['columnname'].str.replace('[^\w\s]', '')

but when I try to make the replace a list I want it does not work is there a way to create my own list something like that

dataframe['columnname'].str.replace(',,?, !, .,:, ;', '')

so anything with , ? ! . : ; ' will be removed and anything else outside this range will stay


Solution

  • Here is necessary espace some special regex characters like . or ?:

    dataframe['columnname'] = dataframe['columnname'].str.replace("[,\?!\.:;']", '')
    

    Or use re.escape:

    import re
    
    pat = '[' + re.escape(",?!.:;'") + ']'
    print (pat)
    [,\?!\.:;']
    
    dataframe['columnname'] = dataframe['columnname'].str.replace(pat, '')