Search code examples
pythonunicode

Unicode encoding for Polish characters in Python


I am having a Polish artist name as follows:

Żółte słonie

In my dataset (json file), it has been encoded as:

\u017b\u00f3\u0142te S\u0142onie

I am reading the json and doing some pre-processing and writing the output to a text file. I get the following error:

UnicodeEncodeError: 'charmap' codec can't encode character u'\u017b' in position 0: character maps to <undefined>

I looked up the Unicode encoding for Polish characters online and the encoding looks fine to me. Since I have never worked with anything other than LATIN before, I wanted to confirm this with the SO community. If the encoding is right, then why is Python not handling it?

Thanks, TM


Solution

  • I have made simple test with Python 2.7 and it seems that json changes type of object from str to unicode. So you have to encode() such string before writing it to text file.

    #!/usr/bin/env python
    # -*- coding: utf8 -*-
    
    import json
    
    s = 'Żółte słonie'
    print(type(s))
    print(repr(s))
    sd = json.dumps(s)
    print(repr(sd))
    s2 = json.loads(sd)
    print(type(s2))
    print(repr(s2))
    
    f = open('out.txt', 'w')
    try:
        f.write(s2)
    except UnicodeEncodeError:
        print('UnicodeEncodeError, encoding data...')
        f.write(s2.encode('UTF8'))
        print('data encoded and saved')
    f.close()