Search code examples
pythonunicodeutf-8utf-16

UTF-16 as sequence of code units in python


I have the string 'abç' which in UTF-8 is b'ab\xc3\xa7'.

I want it in UTF-16, but not this way:

b'ab\xc3\xa7'.decode('utf-8').encode('utf-16-be')

which gives me:

b'\x00a\x00b\x00\xe7'

The answer I want is the UTF-16 code units, that is, a list of int:

[32, 33, 327]

Is there any straightforward way to do that?

And of course, the reverse. Given a list of ints which are UTF-16 code units, how do I convert that to UTF-8?


Solution

  • The simple solution that may work in many cases would be something like:

    def sort_of_get_utf16_code_units(s):
        return list(map(ord, s))
    
    
    print(sort_of_get_utf16_code_units('abç')
    

    Output:

    [97, 98, 231]
    

    However, that doesn't work for characters outside the Basic Multilingual Plane (BMP):

    print(sort_of_get_utf16_code_units('😊'))
    

    Output is the Unicode code point:

    [128522]
    

    Where you might have expected the code units (as your question states):

    [55357, 56842]
    

    To get that:

    def get_utf16_code_units(s):
        utf16_bytes = s.encode('utf-16-be')
        return [int.from_bytes(utf16_bytes[i:i+2]) for i in range(0, len(utf16_bytes), 2)]
    
    
    print(get_utf16_code_units('😊'))
    

    Output:

    [55357, 56842]
    

    Doing the reverse is similar:

    def utf16_code_units_to_string(code_units):
        utf16_bytes = b''.join([unit.to_bytes(2, byteorder='big') for unit in code_units])
        return utf16_bytes.decode('utf-16-be')
    
    
    print(utf16_code_units_to_string([55357, 56842]))
    

    Output:

    😊
    

    The byteorder is 'big' by default, but it doesn't hurt to be specific there.