I am having a Polish artist name as follows:
Żółte słonie
In my dataset (json file), it has been encoded as:
\u017b\u00f3\u0142te S\u0142onie
I am reading the json and doing some pre-processing and writing the output to a text file. I get the following error:
UnicodeEncodeError: 'charmap' codec can't encode character u'\u017b' in position 0: character maps to <undefined>
I looked up the Unicode encoding for Polish characters online and the encoding looks fine to me. Since I have never worked with anything other than LATIN before, I wanted to confirm this with the SO community. If the encoding is right, then why is Python not handling it?
Thanks, TM
I have made simple test with Python 2.7 and it seems that json
changes type of object from str
to unicode
. So you have to encode()
such string before writing it to text file.
#!/usr/bin/env python
# -*- coding: utf8 -*-
import json
s = 'Żółte słonie'
print(type(s))
print(repr(s))
sd = json.dumps(s)
print(repr(sd))
s2 = json.loads(sd)
print(type(s2))
print(repr(s2))
f = open('out.txt', 'w')
try:
f.write(s2)
except UnicodeEncodeError:
print('UnicodeEncodeError, encoding data...')
f.write(s2.encode('UTF8'))
print('data encoded and saved')
f.close()