Search code examples
pythonregexasciipython-renon-ascii-characters

Why Does re.sub() Not Work in Python 3.6?


I'm working on a project where I have to read data from an Excel spreadsheet. I'm using Python.

I noticed when I use "re.sub()" the characters in the original string are not replaced. When I use "string.replace()" the characters from the original string get replaced, but not when I'm using "re.sub()."

I'm wondering if I'm doing something wrong. Could anyone please check this on your end?

Technical Details:

This is what I originally had:

string = re.sub(u'([\u2000-\u206f])', " ", string)
string = re.sub(u'(\u00a0)', " ", string)

string = string.replace("‰", " ") #\u0089
string = string.replace("¤", " ") #\u00a4

Following "chepner"'s advice, I changed the logic to the following:

replacementDict = {}
replacementDict.update(dict.fromkeys(map(chr, range(0x2000, 0x206f)), " "))
replacementDict['\u00a0'] = " "
replacementDict['\u0089'] = " "
replacementDict['\u00a4'] = " "

string = string.translate(replacementDict)

But I'm still not able to remove the illegal characters from the string.

You can download the script and a sample test here:

Steps to reproduce the issue:

  • Run the script as-is (removed the need to send parameters to the script), you will notice the lines that did not match are the ones with illegal characters.

Solution

  • I would replace all this with a single call to str.translate, since you are only making single-character-to-single-character replacements.

    You'll just need to define a single dict (that you can reused for every call to str.translate) that maps each character to its replacement. Characters that stay the same do not need to be added to the mapping.

    replacements = {}
    replacements.update(dict.fromkeys(range(0x2000, 0x2070), " "))
    replacements[0x1680] = ' '
    # etc
    
    string = string.translate(replacements)
    

    You can also use str.maketrans to construct an appropriate translation table from a char-to-char mapping.