I'd like to remove some characters from a string (either byte string or unicode string) using a regular expression like this:
pattern = re.compile(ur'\u00AE|\u2122', re.UNICODE)
If the characters are specified as unicode literals the resulting regexp does not work properly on byte string.
q = 'Canon\xc2\xae EOS 7D'
pattern.sub('', q) # 'Canon\xc2 EOS 7D'
If I convert the argument of the substitution to a unicode string, however, it works as expected...
pattern.sub('', unicode(q)) # u'Canon EOS 7D'
Can someone please explain to me why this is the case?
thanks,
Peter
Because a standard (byte) string is not a Unicode string. Python does not know what encoding it's in (or if it's even Unicode at all!), and so has no way to determine whether a particular Unicode character matches some character in it. The solution is to tell Python it's Unicode, using the unicode()
function, as you have figured out.