Search code examples
pythonregexunicodebackslash

Python: Replacing certain Unicode entities with entities from dictionary


I have read already much about the problem of backslash escaping in python strings (and backslash recognition in Python in different encodings) and using backslashes in regular expressions, but still cannot solve my problem. I would highly appreciate any help (links, code examples, etc).

I am trying to replace hexadecimal codes in strings with certain elements from dictionary using re. The codes are of type '\uhhhh' where hhhh is hex number.

I select strings from sqlite3 table; by default they are read as unicode and not as "raw" unicode strings.

import re
pattern_xml = re.compile(r"""
(.*?)                       
([\\]u[0-9a-fA-F]{4})
(.*?)                           
""", re.VERBOSE | re.IGNORECASE | re.DOTALL)
uni_code=['201C','201D']
decoded=['"','"']
def repl_xml(m):
    item=m.group(2)
    try: decodeditem=decoded[uni_code.index(item.lstrip('\u').upper())]
    except: decodeditem=item
    return m.group(1) + "".join(decodeditem) + m.group(3)

#input        
text = u'Try \u201cquotated text should be here\u201d try'
#text after replacement
decoded_text=pattern_xml.subn(repl_xml,text)[0]
#desired outcome
desired_text=u'Try "quotated text should be here" try'

So, I want _decoded_text_ to be equal to _desired_text_.

I did not succeed in replacing single backslash with double backslash or forcing python to treat text as raw unicode string (so that backslashes are treated literally and not like escape characters). I also tried using re.escape(text) and setting re.UNICODE, but in my case that doesn't help.
I'm using Python 2.7.2.

Which solutions can be found to this problem?

Edit:
I have actually found out a possible solution to this problem on StandardEncodings and PythonUnicodeIntegration by applying the following encoding to input:

text.encode('unicode_escape')

Is there anything else to do?


Solution

  • The sample text doesn't contain any backslashes. The \u201c is just a way of representing a unicode character:

    >>> text = u'Try \u201cquotated text should be here\u201d try'
    >>> '\\' in text
    False
    >>> print text
    Try “quotated text should be here” try
    

    A regex is not really required here. Just translate the target unicode characters as desired:

    >>> table = {0x201c: u'"', 0x201d: u'"'}
    >>> text.translate(table)
    u'Try "quotated text should be here" try'