Search code examples
pythonregexarabicarabic-supportpython-re

Remove special caracters,numbers in an ARABIC text file with python


I want to keep only arabic characters, no numbers, i got this regex instruction from github.

    generalPath="C:/Users/Desktop/Code/dataset/"
    outputPath= "C:/Users/Desktop/Code/output/"
    files = os.listdir(generalPath)

    for onefile in files:
    # relative or absolute file path, e.g.:
        localPath=generalPath+onefile
        localOutputPath=outputPath+onefile
        print(localPath)
        print(localOutputPath)
        with open(localPath, 'rb') as infile, open(localOutputPath, 'w') as outfile:
            data = infile.read().decode('utf-8')
            new_data = t = re.sub(r'[^0-9\u0600-\u06ff\u0750-\u077f\ufb50-\ufbc1\ufbd3-\ufd3f\ufd50-\ufd8f\ufd50-\ufd8f\ufe70-\ufefc\uFDF0-\uFDFD]+', ' ', data)
            outfile.write(new_data)

In this code i got this error: Traceback (most recent call last): File ".\cleanText.py", line 23, in outfile.write(new_data) File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2: character maps to

my arabic text is diacritised and i want to keep it like that enter image description here


Solution

  • It looks like your program is trying to read your text file with CP1252 encoding instead of UTF-8. Specify unicode on opening as shown below. Also, since it's a text file you can read using 'r' instead of 'rb'.

    with open(localPath, 'r', encoding='utf8') as infile
    

    As for your regex, if you just want to remove numbers, you can use

    data = re.sub(r'[0-9]+', '', data)
    

    You don't need to specify the whole Arabic alphabet as characters to keep. But it looks like you have strings like "(1/6)." To get rid of all parentheses and slashes as well, use:

    data = re.sub(r'[0-9\(\)/]+', '', data)