I'm working on a Django app that gets data from a Redshift database. The people who wrote to this Redshift database made some bizarre choices. I'm trying to figure out how handle the fact that it has strings like this:
Let’s play! \xe2\x9a\xbd\xef\xb8\x8f
With psycopg2 I'm getting these into Python as
u'Let\u2019s play! \\xe2\\x9a\\xbd\\xef\\xb8\\x8f'
I'm supposed to render this as
Let’s play! ⚽️
They're stored in the Redshift table as varchar(65535)
. I'm using pyscopg2 to pull them into the app backend (Django 1.11, Python 2.7). They are being rendered with JavaScript on the front end.
Any suggestions would be appreciated -- I can't figure out any solution out at all.
Because there is a mix of encodings (what looks like Windows-1252 and hex-escaped UTF-8) you could try to post-process the entries if this mix is consistent.
Below is a regular expression to find sequences of hex escapes and replace them with a UTF-8 decoded equivalent:
import re
esc = re.compile(r'(?:\\x[0-9a-f]{2})+')
def fixer(m):
return m.group().encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
s = u'Let\u2019s play! \\xe2\\x9a\\xbd\\xef\\xb8\\x8f'
f = esc.sub(fixer,s)
print repr(f)
print f
Output:
u'Let\u2019s play! \u26bd\ufe0f'
Let’s play! ⚽️