Search code examples
pythonunicodeencodingutf-8locale

String with Turkish character to unicode


on ubuntu:

> s = 'kasım' # ı -> 'i' without dot, lowercase letter, turkish.
> print s
> 'kas\xc4\xb1m'
> unicode(s, 'utf-8') 

works just fine.

on windows:

> s = 'kasım' # ı -> 'i' without dot, lowercase letter, turkish.
> print s
> 'kas\x8dm'
> unicode(s, 'utf-8') 

throws an unicodedecode error;

  • 'utf-8' codec can't decode byte 0xfd in position 3: invalid start byte*

before that, locales are set in code like the code below:

 if platform is windows:
         locale_to_set = 'turkish'
 elif platform is linux:
          locale_to_set = 'tr_TR.utf-8'

 locale.setlocale(locale.LC_ALL, locale_to_set)

What is the part i did wrong or missed? any idea will be appreciated.

note:
i am getting that 'Kasım' word (which means november) from datetime.datetime.utcnow().strftime(....) and user can change the language according to preference.


Solution

  • It's a bad idea to depend on the input encoding of your system, because those can differ from system to system, as you discovered. For this reason, it is better to avoid non-ASCII characters in your source code and use symbolic names. For example:

    name = u'kas\u0131m'
    

    If your string is coming from elsewhere in the system, such as from a localized strftime function, you will want to use the proper locale when decoding it into Unicode:

    ignore, encoding = locale.getlocale()
    name = unicode(s, encoding)