How to remove a certain letter from an arabic text in a dataframe using regular expressions?

I have arabic text in a dataframe and I want to remove the letter و from all words that start with this letter. I tried to do this:

def clean(text_string):
    space_pattern = '\bو'
    
    parsed_text = re.sub(space_pattern, '', text_string)
    return parsed_text

and then:

df['tidy_tweet'] = np.vectorize(clean)(df['tidy_tweet'])

but when I run it, nothing changes. It's as if I didn't do anything at all!

Example:

Input: هيه الهزه الحقيقيه وتخافون الهزه وماتخافون الهزه اعملها نظامكم الهمجي

Desired output: هيه الهزه الحقيقيه تخافون الهزه ماتخافون الهزه اعملها نظامكم الهمجي

Solution

You may use the following regex with word boundaries. And use \1 to keep only the remain of the text.

r"\bو(.*?)\b"

import re

text = """هيه الهزه الحقيقيه وتخافون الهزه وماتخافون الهزه اعملها نظامكم الهمجي"""
ref = """هيه الهزه الحقيقيه تخافون الهزه ماتخافون الهزه اعملها نظامكم الهمجي"""
print(text)
print(ref)

new_text = re.sub(r"\bو(.*?)\b", r'\1', text)

print(new_text == ref)

هيه الهزه الحقيقيه وتخافون الهزه وماتخافون الهزه اعملها نظامكم الهمجي
هيه الهزه الحقيقيه تخافون الهزه ماتخافون الهزه اعملها نظامكم الهمجي
True