Search code examples
pythonunicodeencodingutf-8

Converting unicode string to utf-8


Firstly, I am aware that there are tons of questions regarding en/de-coding of strings in Python 2.x, but I can't seem to find a solution to this problem.

I have a unicode string, that contains letter č which is represented as \u00c4\u008d

If in Python console I write

>>> a = u"\u00c4\u008d"
>>> print a

I get two strange characters printed out instead of č, probably because the actual encoding of that string is supposed to be UTF-8. Therefore I try to use .decode("utf-8") but for this I get the standard UnicodeEncodeError.

Do you know how I can make Python print that string as č in the console?


Solution

  • After fighting with python for over an hour, I decided to look for a solution in another language. This is how my goal can be achieved in C#:

    var s = "\u00c4\u008d";
    var newS = Encoding.UTF8.GetString(Encoding.Default.GetBytes(s));
    File.WriteAllText(@"D:\tmp\test.txt", newS, Encoding.UTF8);
    

    Finally! The file now contains č.

    I therefore got inspired by this approach in C# and managed to come up with the following (seemingly) equivalent solution in Python:

    >>> s = u"\u00c4\u008d"
    >>> arr = bytearray(map(ord, s))
    >>> print arr.decode("utf-8")
    č
    

    I'm not sure how good this solution is but it seems to work in my case.