UTF-16 as sequence of code units in python

I have the string 'abç' which in UTF-8 is b'ab\xc3\xa7'.

I want it in UTF-16, but not this way:

b'ab\xc3\xa7'.decode('utf-8').encode('utf-16-be')

which gives me:

b'\x00a\x00b\x00\xe7'

The answer I want is the UTF-16 code units, that is, a list of int:

[32, 33, 327]

Is there any straightforward way to do that?

And of course, the reverse. Given a list of ints which are UTF-16 code units, how do I convert that to UTF-8?

Solution

The simple solution that may work in many cases would be something like:

def sort_of_get_utf16_code_units(s):
    return list(map(ord, s))


print(sort_of_get_utf16_code_units('abç')

Output:

[97, 98, 231]

However, that doesn't work for characters outside the Basic Multilingual Plane (BMP):

print(sort_of_get_utf16_code_units('😊'))

Output is the Unicode code point:

[128522]

Where you might have expected the code units (as your question states):

[55357, 56842]

To get that:

def get_utf16_code_units(s):
    utf16_bytes = s.encode('utf-16-be')
    return [int.from_bytes(utf16_bytes[i:i+2]) for i in range(0, len(utf16_bytes), 2)]


print(get_utf16_code_units('😊'))

Output:

[55357, 56842]

Doing the reverse is similar:

def utf16_code_units_to_string(code_units):
    utf16_bytes = b''.join([unit.to_bytes(2, byteorder='big') for unit in code_units])
    return utf16_bytes.decode('utf-16-be')


print(utf16_code_units_to_string([55357, 56842]))

Output:

😊

The byteorder is 'big' by default, but it doesn't hurt to be specific there.