What is the replacement of deprecated array of unicode chars?

An array of unicode characters can be used as a mutable string :

import array

ins = "Aéí"
ms = array.array('u', ins)
ms[0] = "ä"
outs = ms.tounicode()
# äéí

But type 'u' is deprecated since Python 3.3. What is the modern replacement?

I could do:

ms = list(ins)
# mutate
outs = ''.join(ms)

But I find a list of characters very memory inefficient compared to the array.

Alternatively:

ms = array.array('L', (ord(ch) for ch in ins))
ms[0] = ord("ä")
outs = "".join(chr(ch) for ch in ms)

But it is far less readable than the deprecated original.

Update 2024: Python 3.13 (expected release in autumn 2024) will have array type 'w' that can be used as a 'u' replacement.

Solution

This is similar to your last example, but initializes the array in a more readable and efficient way. You must choose an array size that is four bytes in size. I is the code for unsigned int and is four bytes on most OSes. For portability you may want to choose this value programmatically.

#!coding:utf8
import array
import sys

# Verifying the item size.
assert array.array('I').itemsize == 4

# Choose encoding base on native endianness:
encoding = 'utf-32le' if sys.byteorder == 'little' else 'utf-32be'

ins = "Aéí"
ms = array.array('I',ins.encode(encoding))
ms[0] = ord('ä')
print(ms.tobytes().decode(encoding))

Output:

äéí

Timings for a 1000-element string show it is quite a bit faster:

In [7]: s = ''.join(chr(x) for x in range(1000))

In [8]: %timeit ms = array.array('I',s.encode('utf-32le'))
1.77 µs ± 14.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [9]: %timeit ms = array.array('I',(ord(x) for x in s))
167 µs ± 5.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [21]: %timeit outs = "".join(chr(x) for x in ms)
194 µs ± 4.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [23]: %timeit outs = ms.tobytes().decode('utf-32le')
3.92 µs ± 97 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

But you may be overthinking it. I don't know what string sizes you are dealing with, but just using list(data) is faster, if less memory efficient, and it isn't that bad. Here's a list of non-BMP characters (~1M), and timings for immutable string slicing, mutating an array, and mutating a list:

In [67]: data = ''.join(chr(x) for x in range(0x10000,0x110000))

In [68]: ms = array.array('I',data.encode('utf-32le'))

In [69]: %%timeit global data
    ...: data = data[:500] + 'a' + data[501:]
    ...:
3.33 ms ± 235 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [70]: %timeit ms[500] = ord('a')
73.6 ns ± 0.433 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [71]: %%timeit v = list(data)
    ...: v[500] = 'a'
    ...:
28.7 ns ± 0.144 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [72]: sys.getsizeof(data)
Out[72]: 4194380

In [73]: sys.getsizeof(ms)
Out[73]: 4456524

In [74]: sys.getsizeof(list(data))
Out[74]: 9437296

Mutating a list is straightforward, 3x faster than mutating the array, and only uses a little more that 2x the memory.