Search code examples
pythonstringnumpystring-concatenation

Concatenate and shuffle entries in a list of strings in Python


Suppose I have a list (or a NumPy array) of strings, e.g., X = ['abc', 'def', 'ghi', 'jkl'], a list of indices, e.g., J = [0, 3], and a random permutation p, e.g., [0, 3, 2, 5, 1, 4].

I want to find the most time efficient way to concatenate the strings in X corresponding to the indices J, i.e., concatenate X[j] for each j in J to get 'abcjkl', and apply the permutation p to this string, i.e., 'ajclbk'. Applying a permutation p to a string Y results in another string Z such that Z[i] = Y[p[i]].

Right now, I have X, J, and p as NumPy arrays and do

X = numpy.array(['abc', 'def', 'ghi', 'jkl'])
J = numpy.array([0, 3])
p = numpy.array([0, 3, 2, 5, 1, 4])
Y = ''.join(X[J])
Z = numpy.array(list(Y))
res = ''.join(Z[p])

Is there a faster way to achieve this? I don't have to use NumPy arrays if a solution with lists exist. In my application, list X can have 10 Million strings (formed using the English alphabet) each of length 1000, indices J can be of length 5 Million, and permutation p can be of length 5 Billion.


Solution

  • TL;DR

    Under the assumption that the strings in the original input are of equal length, you can resolve indices concatenation, which, combined with np.ndarray.view(), would allow for a very fast NumPy-only implementation:

    import numpy as np
    
    
    def take2s_reidx_np(arr, idx_a, idx_b):
        k = len(arr[0])
        q, r = np.divmod(idx_b, k)
        return arr.view(("U", 1))[r + k * idx_a[q]].view(("U", len(idx_b)))[0]
    

    which can be re-implemented with Numba to minimize temporary memory usage:

    import numpy as np
    import numba as nb
    
    
    @nb.njit(fastmath=True)
    def apply_cat_idx_nb(arr, idx_a, idx_b, k):
        n = len(idx_b)
        result = np.empty(n, dtype=arr.dtype)
        for i in range(n):
            q, r = divmod(idx_b[i], k)
            result[i] = arr[r + k * idx_a[q]]
        return result
    
    
    def take2s_at_reidx_nb(arr, idx_a, idx_b):
        # use `np.int32` as a proxy of `("U", 1)` because of Numba support
        return apply_cat_idx_nb(arr.view(np.int32), idx_a, idx_b, len(arr[0])).view(("U", len(idx_b)))[0]
    

    It is unclear if using np.uint8 internally leads to any speed advantage for the more optimized methods.

    While if the input uses bytes instead of str, this operation is typically (much?) faster TODO


    Longer Answer

    Actually, NumPy arrays should be the most convenient containers to solve this problem when it comes to speed, especially for very large inputs.

    However, there is no need to go back and forth to strings (or bytes): you can just apply the indices on the array themselves, or a suitable view of it with np.ndarray.view(), which is a brutally cheap operation. The same operation is also used to view the data again in str format.

    Assuming OP's code can be summarized by the following function:

    import numpy as np
    
    
    def take2s_OP(arr, idx_a, idx_b):
        tmp_str = "".join(arr[idx_a])
        tmp_arr = np.array(list(tmp_str))
        return "".join(tmp_arr[idx_b])
    

    a solution based on np.ndarray.view() would look like:

    def take2s_np(arr, idx_a, idx_b):
        return arr[idx_a].view(("U", 1))[idx_b].view(("U", len(idx_b)))[0]
    

    The above does have the advantage of working with Unicode strings (not just ASCII) and does not require prior input manipulation.

    Now, there is still a relatively expensive (O(n)) computation: the advanced indexing, which is done twice in the above code.

    By using the fact that the strings are of the same length k, we can write a single index idx_ab as the combination of the two idx_a followed by idx_b:

    r, q = divmod(idx_b, k)
    idx_ab = r + k * idx_a[q]
    arr[idx_ab] == arr[idx_a][idx_b]
    

    With NumPy this would look something like:

    def take2s_reidx_np(arr, idx_a, idx_b):
        k = len(arr[0])
        q, r = np.divmod(idx_b, k)
        return arr.view(("U", 1))[r + k * idx_a[q]].view(("U", len(idx_b)))[0]
    

    However, the above still creates a large temporary array to store the indices that are used only once. It would be more efficient to generate and access the indices on the go, thus removing the need for the large temporary object. For this, it is possible to write some explicit loops to be accelerated with Numba:

    import numpy as np
    import numba as nb
    
    
    @nb.njit(fastmath=True)
    def apply_cat_idx_nb(arr, idx_a, idx_b, k):
        n = len(idx_b)
        result = np.empty(n, dtype=arr.dtype)
        for i in range(n):
            q, r = divmod(idx_b[i], k)
            result[i] = arr[r + k * idx_a[q]]
        return result
    
    
    def take2s_at_reidx_nb(arr, idx_a, idx_b):
        # use `np.int32` as a proxy of `("U", 1)` because of Numba support
        return apply_cat_idx_nb(arr.view(np.int32), idx_a, idx_b, len(arr[0])).view(("U", len(idx_b)))[0]
    

    Note that the original array is initially viewed as np.int32 (which matches the Unicode datatype size) because the creation of Unicode arrays is still unsupported with Numba.

    Since advanced indexing is a heavily optimized functionality in NumPy, it may be more efficient to still use that and a temporary array to store the indices, but to create them with Numba accelerated code:

    @nb.njit(fastmath=True)
    def cat_idx_nb(idx_a, idx_b, k):
        n = len(idx_b)
        result = np.empty(n, dtype=idx_b.dtype)
        for i in range(n):
            q, r = divmod(idx_b[i], k)
            result[i] = r + k * idx_a[q]
        return result
    
    
    def take2s_reidx_nb(arr, idx_a, idx_b):
        idx_ab = cat_idx_nb(idx_a, idx_b, len(arr[0]))
        return arr.view(("U", 1))[idx_ab].view(("U", len(idx_b)))[0]
    

    Eventually, working with bytes instead of str is going to be faster (but limited to ASCII characters).

    The get from the NumPy array of characters to the bytes, one can simply use the np.ndarray.tobytes() method.

    The approaches presented earlier would read:

    def take2b_OP(arr, idx_a, idx_b):
        tmp_bstr = b"".join(arr[idx_a])
        tmp_arr = np.array(list(tmp_bstr), dtype=np.uint8)
        return b"".join(tmp_arr[idx_b])
    
    
    def take2b_np(arr, idx_a, idx_b):
        return arr[idx_a].view(np.uint8)[idx_b].tobytes()
    
    
    def take2b_reidx_np(arr, idx_a, idx_b):
        k = len(arr[0])
        q, r = np.divmod(idx_b, k)
        return arr.view(np.uint8)[r + k * idx_a[q]].tobytes()
    
    
    def take2b_reidx_nb(arr, idx_a, idx_b):
        idx_ab = cat_idx_nb(idx_a, idx_b, len(arr[0]))
        return arr.view(np.uint8)[idx_ab].tobytes()
    
    
    def take2b_at_reidx_nb(arr, idx_a, idx_b):
        return apply_cat_idx_nb(arr.view(np.uint8), idx_a, idx_b, len(arr[0])).tobytes()
    

    If your string do contain only ASCII characters you could also view the Unicode data of size N type in the input NumPy array as np.uint8 instead of Unicode data of size 1. Care must be taken to ensure that np.uint8 items (1 byte) are aligned to the Unicode items (4 bytes), so one needs to read every 4 chars, but that is a simple slicing:

    def take2sbs_np(arr, idx_a, idx_b):
        return arr[idx_a].view(np.uint8)[::4][idx_b].tobytes().decode()
    

    Under the assumption of ASCII-only strings, these can be re-written to use np.uint8 internally, as long as one accounts for the padding mismatch in the Unicode datatype in NumPy compared to np.uint8 (Unicode datatype uses 4 bytes per char) and uses a different approach to converting the data back to string (essentially combining np.ndarray.tobytes() with str.decode() method to get back to a str), e.g.:

    def take2s_reidx_u8_np(arr, idx_a, idx_b):
        k = len(arr[0])
        q, r = np.divmod(idx_b, k)
        return arr.view(np.uint8)[::4][r + k * idx_a[q]].tobytes().decode()
    
    
    def take2s_reidx_u8_nb(arr, idx_a, idx_b):
        idx_ab = cat_idx_nb(idx_a, idx_b, len(arr[0]))
        return arr.view(np.uint8)[::4][idx_ab].tobytes().decode()
    

    It is unclear if using np.uint8 internally leads to any speed advantage for the more optimized methods.


    For completeness I also include also two solutions (one for str, one for bytes) based on @KarlKnechtel's approach:

    def take2s_2d(arr, idx_a, idx_b):
        arr_2d = np.array([list(x.encode()) for x in arr], dtype=np.uint8)
        return bytes(arr_2d[idx_a, :].ravel()[idx_b]).decode()
    
    def take2b_2d(arr, idx_a, idx_b):
        arr_2d = np.array([list(x) for x in arr], dtype=np.uint8)
        return bytes(arr_2d[idx_a, :].ravel()[idx_b])
    

    which have a similar spirit as take2b_np(), but do require some relatively heavy input manipulation.


    To check that these approaches are all equivalent, here are reported being applied to OP's input:

    s_funcs = take2s_OP, take2s_np, take2s_2d, take2s_reidx_np, take2s_reidx_nb, take2s_at_reidx_nb, take2s_reidx_u8_np, take2s_reidx_u8_nb
    b_funcs = take2b_OP, take2b_np, take2b_2d, take2b_reidx_np, take2b_reidx_nb, take2b_at_reidx_nb
    
    X = ["abc", "def", "ghi", "jkl"]
    Xs = np.array(X)
    Xb = np.array([x.encode() for x in X])
    J = np.array([0, 3])
    p = np.array([0, 3, 2, 5, 1, 4])
    
    for func in s_funcs:
        print(f"{func.__name__:>20s}  {func(Xs, J, p)}")
    
    for func in b_funcs:
        print(f"{func.__name__:>20s}  {func(Xb, J, p)}")
    

    And to see which one is faster for large inputs one can use:

    import string
    
    
    M = 1000  # string length
    n = 500_000
    p = n // 2
    q = n // 2
    
    idx_a = np.random.randint(0, n, p)
    idx_b = np.random.randint(0, M * p, q)
    
    arr_s = np.random.choice(list(string.ascii_letters), (o * M)).view(("U", M))
    base = take2s_np2(arr_s, idx_a, idx_b)
    print(f"`str` funcs, n={len(arr_s)}")
    for func in s_funcs:
        res = func(arr_s, idx_a, idx_b)
        print(f"{func.__name__:>20s}  {base == res}  ", end="")
        %timeit func(arr_s, idx_a, idx_b)
    # `str` funcs, n=500000
    #            take2s_OP  True  38.1 s ± 989 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    #            take2s_np  True  591 ms ± 40.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    #            take2s_2d  True  31.7 s ± 1.49 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
    #      take2s_reidx_np  True  15.9 ms ± 702 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    #      take2s_reidx_nb  True  12.4 ms ± 580 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    #   take2s_at_reidx_nb  True  18.4 ms ± 3.4 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    #   take2s_reidx_u8_np  True  14.5 ms ± 666 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    #   take2s_reidx_u8_nb  True  11.8 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    arr_b = np.array([x.encode() for x in arr_s])
    base = take2b_np2(arr_b, idx_a, idx_b)
    print(f"`bytes` funcs, n={len(arr_b)}")
    for func in b_funcs:
        res = func(arr_b, idx_a, idx_b)
        print(f"{func.__name__:>20s}  {base == res}  ", end="")
        %timeit func(arr_b, idx_a, idx_b)
    # `bytes` funcs, n=500000
    #            take2b_OP  True  14.4 s ± 1.34 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
    #            take2b_np  True  150 ms ± 9.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    #            take2b_2d  True  30.7 s ± 1.22 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
    #      take2b_reidx_np  True  13 ms ± 722 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    #      take2b_reidx_nb  True  9.13 ms ± 686 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    #   take2b_at_reidx_nb  True  12.8 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Note that I used N = 500_000 as larger values would deplete system memory.


    More Benchmarks

    To get some idea how the different methods fare for different input sizes, one can use the following functions, where n is the size of arr, p the size of idx_a and q the size of idx_b:

    import string
    import pandas as pd
    import matplotlib.pyplot as plt
    
    
    def benchmark(
        funcs,
        ii=range(2, 15, 1),
        m=16,
        is_equal=lambda a, b: all(x == y for x, y in zip(a, b)),
        seed=0,
        unit="ms",
        verbose=True,
        use_str=True
    ):
        labels = [func.__name__ for func in funcs]
        units = {"s": 0, "ms": 3, "µs": 6, "ns": 9}
        assert unit in units
        np.random.seed(seed)
    
        M = 1000  # string length
        timings = {}
        for i in ii:
            n = 2 ** i
            p = n // 2
            q = n // 2
            if verbose:
                print(f"i={i}, n={n}, m={m}, k={k}  (p={p}, q={q})")
            alph = string.ascii_letters
            if use_str:
                arr = np.random.choice(list(string.ascii_letters), (n * M)).view(("U", M))
            else:
                arr = np.random.choice(list(string.ascii_letters.encode()), (n * M)).astype(np.uint8).view(("S", M))
            idx_a = np.random.randint(0, n, p)
            idx_b = np.random.randint(0, M * p, q)
            base = funcs[0](arr, idx_a, idx_b)
            timings[n] = []
            for func in funcs:
                res = func(arr, idx_a, idx_b)
                is_good = is_equal(base, res)
                timed = %timeit -n 16 -r 4 -q -o [func(arr, idx_a, idx_b) for arr in arrs]
                timing = timed.best / k
                timings[n].append(timing if is_good else None)
                if verbose:
                    print(
                        f"{func.__name__:>24}"
                        f"  {is_good!s:5}"
                        f"  {timing * (10 ** units[unit]):10.3f} {unit}"
                        f"  {timings[n][0] / timing:6.1f}x")
        return timings, labels
    
    
    def plot(timings, labels, title=None, xlabel="Input Size / #", unit="ms"):
        n_rows = 1
        n_cols = 3
        fig, axs = plt.subplots(n_rows, n_cols, figsize=(8 * n_cols, 6 * n_rows), squeeze=False)
        units = {"s": 0, "ms": 3, "µs": 6, "ns": 9}
        df = pd.DataFrame(data=timings, index=labels).transpose()
        
        base = df[[labels[0]]].to_numpy()
        (df * 10 ** units[unit]).plot(marker="o", xlabel=xlabel, ylabel=f"Best timing / {unit}", ax=axs[0, 0])
        (df / base * 100).plot(marker='o', xlabel=xlabel, ylabel='Relative speed / %', logx=True, ax=axs[0, 1])
        (base / df).plot(marker='o', xlabel=xlabel, ylabel='Speed Gain / x', ax=axs[0, 2])
    
        if title:
            fig.suptitle(title)
        fig.patch.set_facecolor('white')
    

    called with:

    s_timings, s_labels = benchmark(s_funcs, unit="ms", use_str=True, verbose=True)
    plot(s_timings, s_labels, "Benchmark with `str`", unit="ms")
    
    b_timings, b_labels = benchmark(b_funcs, unit="ms", use_str=False, verbose=True)
    plot(b_timings, b_labels, "Benchmark with `bytes`", unit="ms")
    

    to get the following plots:

    bm_str bm_bytes

    Indicating that:

    • reindexing-based approaches lead to much faster execution (~4000x or more for even modest input size of ~2000 chars)
    • take2*_np() approaches is much faster than the OP's but much slower than reindexing
    • reindexing with advanced indexing done together seems to be generally on par compared to performing the two separately
    • it is unclear whether using np.uint8 for str input is beneficial.
    • working with bytes seems to be generally faster
    • take2s_2d() seems to be marginally faster than take2s_OP()
    • take2b_2d() seems to be largely slower than take2b_OP()
    • take2b_2d() and take2s_2d() seem to be largely on par

    This means that the "preparation" of the input in take2*_2d() adds a significant penalty to the overall approach, which should otherwise be of similar performance as take2b_np().