Search code examples
pythonperformanceparallel-processingnumpy-ndarray

How can i speed up this python code with parallel processing?


I'm doing a lot of processing on a "big" (80,000 elements) numpy array "x_a"

order = np.arange(1, 101)
rho = [spctr.aryule(x_a, i, norm='biased')[1] for i in order]

(numpy imported as np, spectrum imported as spctr), I'm doing a bunch of independent calls to a function that takes as input the array (and doesn't modify it) and getting the result for different values of i and storing them in the rho list.

in this example it takes 2m 20s, but as the upper bound of order increases it shoots up pretty fast (1h+ for 500) so I'm trying to speed things up with parallel processing.

following some guides it seems multiprocessing is the best option as threading is limited by the global interpreter lock, so i tried this:

order = np.arange(1, 101)
def f1(i):
    return spctr.aryule(x_a, i, norm='biased')[1]
with mp.Pool() as pool:
    rho = pool.map(f1, order)

(multiprocessing imported as mp) This was way worse, i killed the process after 10 minutes, it seems the pickling of too much data is too blame, so next i tried importing ThreadPool from multiprocessing.Pool and doing this:

order = np.arange(1, 101)
def f1(i):
    return spctr.aryule(x_a, i, norm='biased')[1]
with ThreadPool() as pool:
    rho = pool.map(f1, order)

going back to a more reasonable 2m and 30s of execution time, however, that's no improvement.

I'm very unfamiliar on the ways of parallel processing with python... please help.


Solution

  • Using multiprocessing here is not a good idea here. In fact, aryule is very inefficient. Parallelizing an inefficient implementation simply waste more resources. Moreover, the way aryule works internally will cause the parallelization not to be much faster. We can actually write a drastically faster implementation running sequentially.


    Analysis

    in this example it takes 2m 20s, but as the upper bound of order increases it shoots up pretty fast (1h+ for 500) so I'm trying to speed things up with parallel processing.

    This is because aryule compute a correlation running in O(order * len(X)). The bigger the order, the slower the execution. Testing all orders cause the overall algorithm to run at least in O(order^2 * len(X)) time which is pretty bad regarding the actual inputs. The first thing to do is to improve the complexity of the algorithm. This is possible since the correlation used in aryule already iterate over all the order from 0 to order so your code just recompute the same work over and over. This is not trivial to know though since it requires to look the source of the package available here and understand what would the code do in your specific case.

    Since the correlation iterate from 0 to order and you do it too, the first iteration are much faster than the last one. This cause a work imbalance which is detrimental to the parallelization since the time is dominated by the slowest iterations. In the worst case, the execution can be up to twice slower than what it could be (due to only half the thread actually working).

    To understand how to write an efficient code, we need to understand how inefficient it actually is in the first place. The correlation seems to be the main bottleneck here as it takes >99% of the time on your example. Here is an equivalent simplified implementation:

    import spectrum
    import numpy as np
    
    def correlation(x, order):
        x = np.array(x)
        N = len(x)
        assert order < N, 'lag must be less than len(x)'
    
        r = np.zeros(order, dtype=float)
    
        for k in range(0, order+1):
            nk = N - k - 1
            sum = 0
            for j in range(0, nk+1):
                sum = sum + x[j+k] * x[j]
            if k == 0:
                r0 = sum / float(N)
            else:
                r[k-1] = sum / float(N)
    
        r = np.insert(r, 0, r0)
        return r
    
    def aryule(X, order):
        r = correlation(X, order)
        A, P, k = spectrum.LEVINSON(r, allow_singularity=True)
        return A, P, k
    
    order = np.arange(1, 101)
    rho = [spctr.aryule(x_a, i)[1] for i in order]
    

    As we can see, the main hot loop of correlation is written in pure-Python and it is not vectorized! This is very bad in term of performance. np.array(x) performs a copy of the array while it is not needed here. Some expression like 1 / float(N) are recomputed many times for no reason. np.insert performs a copy of the array to add one item while this is not needed: the array can be created directly with the right size in the first place.


    Faster implementation (using vectorization)

    Here is a much faster implementation:

    def correlation(x, order):
        assert order < x.size, 'lag must be less than len(x)'
    
        invN = 1.0 / x.size
        r = np.zeros(order+1, dtype=float)
        r[0] = np.sum(x * x) * invN
    
        for k in range(1, order+1):
            r[k] = np.sum(x[k:] * x[0:x.size-k]) * invN
    
        return r
    
    def aryule(X, order):
        r = correlation(X, order)
        A, P, k = spectrum.LEVINSON(r, allow_singularity=True)
        return A, P, k
    
    order = np.arange(1, 101)
    rho = [spctr.aryule(x_a, i)[1] for i in order]
    

    The code is not only faster by several order of magnitude, it is also simpler. I also expect the code to be even more accurate since the reference sum was quite numerically unstable!


    Better algorithmic complexity

    Moreover, we can just compute the correlation once and compute the LEVINSON operation on the sub-arrays so to strongly improve the complexity of the algorithm and so its execution time. Here is the final implementation:

    # [...] Same code as above
    
    def multiAryuleP(X, orders):
        maxOrder = np.max(orders)
        r = correlation(X, maxOrder)
        rho = np.empty(orders.size)
        for i in range(orders.size):
            rho[i] = spectrum.LEVINSON(r[:orders[i]+1], allow_singularity=True)[1]
        return rho
    
    order = np.arange(1, 101)
    rho = multiAryuleP(X, order)
    

    Note the overall algorithm runs in O(order*len(X) + order^3) since the complexity of LEVINSON is O(order^2).


    Even faster implementation (using Numba)

    It turns out that the implementation of spectrum.LEVINSON is also inefficient. That being said, this one is not trivial to implement efficiently using Numpy. One solution is to use Numba which is a JIT compiler generating efficient code from Numpy-based code having pure-Python loops. Cython can also do the job very well. Here is an example:

    import numba as nb
    
    # [...] Same code as above
    
    @nb.njit('(float64[::1],)', fastmath=True)
    def levinsonP(r):
        A = np.zeros(r.size - 1)
        P = r[0]
    
        if r.size > 1:
            save = r[1]
            temp = -save / P
            P *= 1.0 - temp * temp
            A[0] = temp
    
        for k in range(1, r.size - 1):
            save = r[k+1]
            for j in range(0, k):
                save += A[j] * r[k-j]
            temp = -save / P
            P *= 1.0 - temp * temp
            A[k] = temp
    
            khalf = (k+1) // 2
            for j in range(0, khalf):
                kj = k-j-1
                save = A[j]
                A[j] = save + temp * A[kj]
                if j != kj:
                    A[kj] += temp*save
    
        return P
    
    def multiAryuleP(X, orders):
        maxOrder = np.max(orders)
        r = correlation(X, maxOrder)
        rho = np.empty(orders.size)
        for i in range(orders.size):
            rho[i] = levinsonP(r[:orders[i]+1])
        return rho
    
    order = np.arange(1, 101)
    rho = multiAryuleP(X, order)
    

    Performance results

    Here are the results on my machine (with a i5-9600KF processor):

    Initial implementation:                     74000   ms
    Vectorized implementation:                    300   ms
    Better complexity + vectorization:             75   ms
    Better complexity + vectorization + Numba:      4.5 ms
    Optimal implementation:                       < 1.0 ms
    

    The final implementation is about 16400 times faster and it is still sequential! Not to mention is even faster with bigger orders (>40000 times faster for 500).


    Possible further improvements

    If this is not enough, not that the complexity of multiAryuleP can certainly be improved. Indeed, the re-computation of some expressions in levinsonP can be avoided so this part can be O(order) times faster. Moreover, one can certainly use a FFT so to compute the correlation significantly faster. Alternatively, the correlation can be optimized further using Numba and using a tile-based approach so the implementation can be more cache-friendly.