Search code examples
pythonstringunicode

How to convert a list of bytes (Unicode) to a Python string?


I have a list of bytes (8-bit bytes, or in C/C++ language they form a wchar_t type string), they form a Unicode string (byte by byte). How to convert those values into a Python string? I tried a few things, but none could join those 2 bytes into 1 character and build an entire string from it. Thank you.


Solution

  • Converting a sequence of bytes to a Unicode string is done by calling the decode() method on that str (in Python 2.x) or bytes (Python 3.x) object.

    If you actually have a list of bytes, then, to get this object, you can use ''.join(bytelist) or b''.join(bytelist).

    You need to specify the encoding that was used to encode the original Unicode string.

    However, the term "Python string" is a bit ambiguous and also version-dependent. The Python str type stands for a byte string in Python 2.x and a Unicode string in Python 3.x. So, in Python 2, just doing ''.join(bytelist) will give you a str object.

    Demo for Python 2:

    In [1]: 'тест'
    Out[1]: '\xd1\x82\xd0\xb5\xd1\x81\xd1\x82'
    
    In [2]: bytelist = ['\xd1', '\x82', '\xd0', '\xb5', '\xd1', '\x81', '\xd1', '\x82']
    
    In [3]: ''.join(bytelist).decode('utf-8')
    Out[3]: u'\u0442\u0435\u0441\u0442'
    
    In [4]: print ''.join(bytelist).decode('utf-8') # encodes to the terminal encoding
    тест
    
    In [5]: ''.join(bytelist) == 'тест'
    Out[5]: True