Search code examples
python-3.xunicode

What is the replacement of deprecated array of unicode chars?


An array of unicode characters can be used as a mutable string :

import array

ins = "Aéí"
ms = array.array('u', ins)
ms[0] = "ä"
outs = ms.tounicode()
# äéí

But type 'u' is deprecated since Python 3.3. What is the modern replacement?

I could do:

ms = list(ins)
# mutate
outs = ''.join(ms)

But I find a list of characters very memory inefficient compared to the array.

Alternatively:

ms = array.array('L', (ord(ch) for ch in ins))
ms[0] = ord("ä")
outs = "".join(chr(ch) for ch in ms)

But it is far less readable than the deprecated original.


Update 2024: Python 3.13 (expected release in autumn 2024) will have array type 'w' that can be used as a 'u' replacement.


Solution

  • This is similar to your last example, but initializes the array in a more readable and efficient way. You must choose an array size that is four bytes in size. I is the code for unsigned int and is four bytes on most OSes. For portability you may want to choose this value programmatically.

    #!coding:utf8
    import array
    import sys
    
    # Verifying the item size.
    assert array.array('I').itemsize == 4
    
    # Choose encoding base on native endianness:
    encoding = 'utf-32le' if sys.byteorder == 'little' else 'utf-32be'
    
    ins = "Aéí"
    ms = array.array('I',ins.encode(encoding))
    ms[0] = ord('ä')
    print(ms.tobytes().decode(encoding))
    

    Output:

    äéí
    

    Timings for a 1000-element string show it is quite a bit faster:

    In [7]: s = ''.join(chr(x) for x in range(1000))
    
    In [8]: %timeit ms = array.array('I',s.encode('utf-32le'))
    1.77 µs ± 14.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    
    In [9]: %timeit ms = array.array('I',(ord(x) for x in s))
    167 µs ± 5.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    In [21]: %timeit outs = "".join(chr(x) for x in ms)
    194 µs ± 4.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    In [23]: %timeit outs = ms.tobytes().decode('utf-32le')
    3.92 µs ± 97 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

    But you may be overthinking it. I don't know what string sizes you are dealing with, but just using list(data) is faster, if less memory efficient, and it isn't that bad. Here's a list of non-BMP characters (~1M), and timings for immutable string slicing, mutating an array, and mutating a list:

    In [67]: data = ''.join(chr(x) for x in range(0x10000,0x110000))
    
    In [68]: ms = array.array('I',data.encode('utf-32le'))
    
    In [69]: %%timeit global data
        ...: data = data[:500] + 'a' + data[501:]
        ...:
    3.33 ms ± 235 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In [70]: %timeit ms[500] = ord('a')
    73.6 ns ± 0.433 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
    
    In [71]: %%timeit v = list(data)
        ...: v[500] = 'a'
        ...:
    28.7 ns ± 0.144 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
    
    In [72]: sys.getsizeof(data)
    Out[72]: 4194380
    
    In [73]: sys.getsizeof(ms)
    Out[73]: 4456524
    
    In [74]: sys.getsizeof(list(data))
    Out[74]: 9437296
    

    Mutating a list is straightforward, 3x faster than mutating the array, and only uses a little more that 2x the memory.