I am trying to print a unicode string without the specific encoding hex in it. I'm grabbing this data from facebook which has an encoding type in the html headers of UTF-8. When I print the type - it says its unicode, but then when I try to decode it with unicode-escape says there is an encoding error. Why is it trying to encode when I use the decode method?
Code
a='really long string of unicode html text that i wont reprint'
print type(a)
>>> <type 'unicode'>
print a.decode('unicode-escape')
>>> Traceback (most recent call last):
File "scfbp.py", line 203, in myFunctionPage
print a.decode('unicode-escape')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 1945: ordinal not in range(128)
It's not the decode that's failing. It's because you are trying to display the result to the console. When you use print it encodes the string using the default encoding which is ASCII. Don't use print and it should work.
>>> a=u'really long string containing \\u20ac and some other text' >>> type(a) <type 'unicode'> >>> a.decode('unicode-escape') u'really long string containing \u20ac and some other text' >>> print a.decode('unicode-escape') Traceback (most recent call last): File "<stdin>", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 30: ordinal not in range(128)
I'd recommend using IDLE or some other interpreter that can output unicode, then you won't get this problem.
Update: Note that this is not the same as the situtation with one less backslash, where it fails during the decode, but with the same error message:
>>> a=u'really long string containing \u20ac and some other text' >>> type(a) <type 'unicode'> >>> a.decode('unicode-escape') Traceback (most recent call last): File "<stdin>", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 30: ordinal not in range(128)