Search code examples
unicodeutf-8python-unicodeunicode-normalizationunicode-literals

Python, Unicode: Working with badly formatted strings


I'm working on a Django app that gets data from a Redshift database. The people who wrote to this Redshift database made some bizarre choices. I'm trying to figure out how handle the fact that it has strings like this:

Let’s play! \xe2\x9a\xbd\xef\xb8\x8f

With psycopg2 I'm getting these into Python as

u'Let\u2019s play! \\xe2\\x9a\\xbd\\xef\\xb8\\x8f'

I'm supposed to render this as

Let’s play! ⚽️

They're stored in the Redshift table as varchar(65535). I'm using pyscopg2 to pull them into the app backend (Django 1.11, Python 2.7). They are being rendered with JavaScript on the front end.

Any suggestions would be appreciated -- I can't figure out any solution out at all.


Solution

  • Because there is a mix of encodings (what looks like Windows-1252 and hex-escaped UTF-8) you could try to post-process the entries if this mix is consistent.

    Below is a regular expression to find sequences of hex escapes and replace them with a UTF-8 decoded equivalent:

    import re
    
    esc = re.compile(r'(?:\\x[0-9a-f]{2})+')
    
    def fixer(m):
        return m.group().encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
    
    s = u'Let\u2019s play! \\xe2\\x9a\\xbd\\xef\\xb8\\x8f'
    f = esc.sub(fixer,s)
    print repr(f)
    print f
    

    Output:

    u'Let\u2019s play! \u26bd\ufe0f'
    Let’s play! ⚽️