Search code examples
pandasparsingstr-replacebengali

pandas - Remove a particular character as well as the previous and subsequent characters


I have translated Bengali phonetics into English. But after parsing, I got some trash characters, which I want to remove. My data frame looks like this.

col1        
utto্tor        
dokkho্shin     
muuns্si    

So I want to remove the trash character along with its previous and following character as well. For example: In the first row, I want to remove - this character and also the character o and t, which is the adjacent of (this) character.

My desired output is looks like the following-

col1            col2
utto্tor        uttor
dokkho্shin     dokkhhin
muuns্si        muuni

P.S. I have got these kind of character by using Avro parser which looks like below:

reversed_text = avro.reverse("উত্তর")
print(reversed_text)

output: utto্tor
col0        col1
উত্তর       utto্tor
দক্ষিণ      dokkho্shin
মুন্সী         muuns্si

Solution

  • You can use str.replace removing all non ascii characters and the characters before/after them:

    df['col2'] = df['col1'].str.replace(r'.[^\x00-\x7F].', '', regex=True)
    

    output:

             col1      col2
    0     utto্tor     uttor
    1  dokkho্shin  dokkhhin
    2     muuns্si     muuni