An array of unicode characters can be used as a mutable string :
import array
ins = "Aéí"
ms = array.array('u', ins)
ms[0] = "ä"
outs = ms.tounicode()
# äéí
But type 'u'
is deprecated since Python 3.3. What is the modern replacement?
I could do:
ms = list(ins)
# mutate
outs = ''.join(ms)
But I find a list of characters very memory inefficient compared to the array.
Alternatively:
ms = array.array('L', (ord(ch) for ch in ins))
ms[0] = ord("ä")
outs = "".join(chr(ch) for ch in ms)
But it is far less readable than the deprecated original.
Update 2024: Python 3.13 (expected release in autumn 2024) will have array type 'w' that can be used as a 'u' replacement.
This is similar to your last example, but initializes the array in a more readable and efficient way. You must choose an array size that is four bytes in size. I
is the code for unsigned int
and is four bytes on most OSes. For portability you may want to choose this value programmatically.
#!coding:utf8
import array
import sys
# Verifying the item size.
assert array.array('I').itemsize == 4
# Choose encoding base on native endianness:
encoding = 'utf-32le' if sys.byteorder == 'little' else 'utf-32be'
ins = "Aéí"
ms = array.array('I',ins.encode(encoding))
ms[0] = ord('ä')
print(ms.tobytes().decode(encoding))
Output:
äéí
Timings for a 1000-element string show it is quite a bit faster:
In [7]: s = ''.join(chr(x) for x in range(1000))
In [8]: %timeit ms = array.array('I',s.encode('utf-32le'))
1.77 µs ± 14.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [9]: %timeit ms = array.array('I',(ord(x) for x in s))
167 µs ± 5.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [21]: %timeit outs = "".join(chr(x) for x in ms)
194 µs ± 4.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [23]: %timeit outs = ms.tobytes().decode('utf-32le')
3.92 µs ± 97 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
But you may be overthinking it. I don't know what string sizes you are dealing with, but just using list(data)
is faster, if less memory efficient, and it isn't that bad. Here's a list of non-BMP characters (~1M), and timings for immutable string slicing, mutating an array, and mutating a list:
In [67]: data = ''.join(chr(x) for x in range(0x10000,0x110000))
In [68]: ms = array.array('I',data.encode('utf-32le'))
In [69]: %%timeit global data
...: data = data[:500] + 'a' + data[501:]
...:
3.33 ms ± 235 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [70]: %timeit ms[500] = ord('a')
73.6 ns ± 0.433 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [71]: %%timeit v = list(data)
...: v[500] = 'a'
...:
28.7 ns ± 0.144 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [72]: sys.getsizeof(data)
Out[72]: 4194380
In [73]: sys.getsizeof(ms)
Out[73]: 4456524
In [74]: sys.getsizeof(list(data))
Out[74]: 9437296
Mutating a list is straightforward, 3x faster than mutating the array, and only uses a little more that 2x the memory.