I'm working on a project where I have to read data from an Excel spreadsheet. I'm using Python.
I noticed when I use "re.sub()" the characters in the original string are not replaced. When I use "string.replace()" the characters from the original string get replaced, but not when I'm using "re.sub()."
I'm wondering if I'm doing something wrong. Could anyone please check this on your end?
Technical Details:
This is what I originally had:
string = re.sub(u'([\u2000-\u206f])', " ", string)
string = re.sub(u'(\u00a0)', " ", string)
string = string.replace("‰", " ") #\u0089
string = string.replace("¤", " ") #\u00a4
Following "chepner"'s advice, I changed the logic to the following:
replacementDict = {}
replacementDict.update(dict.fromkeys(map(chr, range(0x2000, 0x206f)), " "))
replacementDict['\u00a0'] = " "
replacementDict['\u0089'] = " "
replacementDict['\u00a4'] = " "
string = string.translate(replacementDict)
But I'm still not able to remove the illegal characters from the string.
You can download the script and a sample test here:
Steps to reproduce the issue:
I would replace all this with a single call to str.translate
, since you are only making single-character-to-single-character replacements.
You'll just need to define a single dict
(that you can reused for every call to str.translate
) that maps each character to its replacement. Characters that stay the same do not need to be added to the mapping.
replacements = {}
replacements.update(dict.fromkeys(range(0x2000, 0x2070), " "))
replacements[0x1680] = ' '
# etc
string = string.translate(replacements)
You can also use str.maketrans
to construct an appropriate translation table from a char-to-char mapping.