Search code examples
pythonnumpygpuvectorizationnumba

How can I increase performance Python script using numpy and numba?


How can I increase performance Python script using numpy and numba?

I’m trying to convert decimal number to 21-number system.

Input: [15, 18, 28, 11, 7, 5, 41, 139, 6, 507]

Output: [[15], [18], [1, 7], [11], [7], [5], [1, 20], [6, 13], [6], [1, 3, 3]]

My script is working well using CPU.

How can I modify my script? I want to increase performance using GPU.

import numpy as np
from timeit import default_timer as timer
from numba import vectorize
import numba as nb

elements = [
    "n|0",
    "n|1",
    "n|2",
    "n|3",
    "n|4",
    "n|5",
    "n|6",
    "n|7",
    "n|8",
    "n|9",
    "n|10",

    "o|+",
    "o|*",
    "o|/",
    "om|-",

    "bl|(",
    "br|)",

    "e|**2",
    "e|**3",
    "e|**0.5",
    "e|**(1/3)",
]

elements_len = len(elements)

def decimal_to_custom(number):
    x = (number % elements_len)
    ch = [x]
    if (number - x != 0):
        return decimal_to_custom(number // elements_len) + ch
    else:
        return ch

decimal_numbers = np.array([15, 18, 28, 11, 7, 5, 41, 139, 6, 507]) #very big array
custom_numers = []
for decimal_number in decimal_numbers:
    custom_numer = decimal_to_custom(decimal_number)
    custom_numers.append(custom_numer)

print(custom_numers)

Solution

  • Your code can be summarized as:

    import numpy as np
    
    
    def decimal_to_custom(number, k):
        x = (number % k)
        ch = [x]
        if (number - x != 0):
            return decimal_to_custom(number // k, k) + ch
        else:
            return ch
    
    
    def remainders_OP(arr, k):    
        result = []
        for value in arr:
            result.append(decimal_to_custom(value, k))
        return result
    
    
    decimal_numbers = np.array([15, 18, 28, 11, 7, 5, 41, 139, 6, 507]) #very big array
    print(remainders_OP(decimal_numbers, elements_len))
    # [[15], [18], [1, 7], [11], [7], [5], [1, 20], [6, 13], [6], [1, 3, 3]]
    

    This code can be speed-up already by replacing the costly recursive implementation of decimal_to_custom() with an iterative and simpler version mod_list() which appends and revert rather than the very expensive head insert (equivalent to list.insert(0, x)) that is implemented in OP:

    def mod_list(x, k):
        result = []
        while x >= k:
            result.append(x % k)
            x //= k
        result.append(x)
        return result[::-1]
    
    
    def remainders(arr, k):
        result = []
        for x in arr:
            result.append(mod_list(x, k))
        return result
    
    
    print(remainders(decimal_numbers, elements_len))
    # [[15], [18], [1, 7], [11], [7], [5], [1, 20], [6, 13], [6], [1, 3, 3]]
    

    Now, both can be accelerated with Numba, to obtain some speed-up:

    import numba as nb
    
    
    @nb.njit
    def mod_list_nb(x, k):
        result = []
        while x >= k:
            result.append(x % k)
            x //= k
        result.append(x)
        return result[::-1]
    
    
    @nb.njit
    def remainders_nb(arr, k):
        result = []
        for x in arr:
            result.append(mod_list_nb(x, k))
        return result
    
    
    print(remainders_nb(decimal_numbers, elements_len))
    # [[15], [18], [1, 7], [11], [7], [5], [1, 20], [6, 13], [6], [1, 3, 3]]
    

    A number of options can be passed on to the decorator, including target_backend="cuda" to have the computation to run on the GPU. As we shall see with the benchmarks, it is not going to be beneficial. The reason is that list.append() (as well as list.insert()) is not easy to run in parallel, and hence you cannot easily exploit the massive parallelism of GPUs!

    Anyway, the above solutions are slowed down by the choice of the underlying data container.

    If one uses fixed size arrays instead of dynamically growing a list at each iteration, this is going to result in a much faster execution:

    def remainders_fixed_np(arr, k, m):
        arr = arr.copy()
        n = len(arr)
        result = np.empty((n, m), dtype=np.int_)
        for i in range(m - 1, -1, -1):
            result[:, i] = arr[:, i + 1] % k
            arr //= k
        return result
    
    
    print(remainders_fixed_np(decimal_numbers, elements_len, 3).T)
    # [[ 0  0  0  0  0  0  0  0  0  1]
    #  [ 0  0  1  0  0  0  1  6  0  3]
    #  [15 18  7 11  7  5 20 13  6  3]]
    

    or, with Numba acceleration (and avoiding unnecessary computation):

    @nb.njit
    def remainders_fixed_nb(arr, k, m):
        n = len(arr)
        result = np.zeros((n, m), dtype=np.int_)
        for i in range(n):
            j = m - 1
            x = arr[i]
            while x >= k:
                q, r = divmod(x, k)
                result[i, j] = r
                x = q
                j -= 1
            result[i, j] = x
        return result
    
    
    print(remainders_fixed_nb(decimal_numbers, elements_len, 3).T)
    # [[ 0  0  0  0  0  0  0  0  0  1]
    #  [ 0  0  1  0  0  0  1  6  0  3]
    #  [15 18  7 11  7  5 20 13  6  3]]
    

    Some Benchmarks

    Now some benchmarks run on Google Colab show some indicative timings, where:

    • the _nb ending indicates Numba acceleration
    • the _pnb ending indicates Numba acceleration with parallel=True and the outermost range() replaced with nb.prange()
    • the _cunb ending indicates Numba acceleration with target CUDA target_backend="cuda"
    • the _cupnb is Numba acceleration with both parallelization and target CUDA
    m = 4
    n = 100000
    arr = np.random.randint(1, k ** m - 1, n)
    
    funcs = remainders_OP, remainders, remainders_nb, remainders_cunb
    base = funcs[0](arr, k)
    for func in funcs:
        res = func(arr, k)
        is_good = base == res
        print(f"{func.__name__:>16s}  {is_good!s:>5s}  ", end="")
        %timeit -n 4 -r 4 func(arr, k)
    #    remainders_OP   True  333 ms ± 4.38 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
    #       remainders   True  268 ms ± 5.11 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
    #    remainders_nb   True  46.9 ms ± 3.16 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
    #  remainders_cunb   True  46.4 ms ± 1.71 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
    
    
    fixed_funcs = remainders_fixed_np, remainders_fixed_nb, remainders_fixed_pnb, remainders_fixed_cunb, remainders_fixed_cupnb
    base = fixed_funcs[0](arr, k, m)
    for func in fixed_funcs:
        res = func(arr, k, m)
        is_good = np.all(base == res)
        print(f"{func.__name__:>24s}  {is_good!s:>5s}  ", end="")
        %timeit -n 8 -r 8 func(arr, k, m)
    #      remainders_fixed_np   True  10 ms ± 2.09 ms per loop (mean ± std. dev. of 8 runs, 8 loops each)
    #      remainders_fixed_nb   True  3.6 ms ± 315 µs per loop (mean ± std. dev. of 8 runs, 8 loops each)
    #     remainders_fixed_pnb   True  2.68 ms ± 550 µs per loop (mean ± std. dev. of 8 runs, 8 loops each)
    #    remainders_fixed_cunb   True  3.49 ms ± 192 µs per loop (mean ± std. dev. of 8 runs, 8 loops each)
    #   remainders_fixed_cupnb   True  2.63 ms ± 314 µs per loop (mean ± std. dev. of 8 runs, 8 loops each)
    

    This indicate that running on the GPU has minimal effect. The greatest speed-up is obtained by changing the data container to a pre-allocated one. The Numba acceleration provides some acceleration both with the dynamic allocation and with the pre-allocated versions.