Search code examples
pythonparallel-processingjitnumba

numba @jit slower that pure python?


so i need to improve the execution time for a script that i have been working on. I started working with numba jit decorator to try parallel computing however it throws me

KeyError: "Does not support option: 'parallel'"

so i decided to test the nogil if it unlocks the whole capabilities from my cpu but it was slower than pure python i dont understand why this happened, and if someone can help me or guide me i will be very grateful

import numpy as np
from numba import *
@jit(['float64[:,:],float64[:,:]'],'(n,m),(n,m)->(n,m)',nogil=True)
def asd(x,y):
    return x+y
u=np.random.random(100)
w=np.random.random(100)

%timeit asd(u,w)
%timeit u+w

10000 loops, best of 3: 137 µs per loop The slowest run took 7.13 times longer than the fastest. This could mean that an intermediate result is being cached 1000000 loops, best of 3: 1.75 µs per loop


Solution

  • You cannot expect numba to outperform numpy on such a simple vectorized operation. Also your comparison isn't exactly fair since the numba function includes the cost of the outside function call. If you sum a larger array, you'll see that the performance of the two converge and what you are seeing is just overhead on a very fast operation:

    import numpy as np
    import numba as nb
    
    @nb.njit
    def asd(x,y):
        return x+y
    
    def asd2(x, y):
        return x + y
    
    u=np.random.random(10000)
    w=np.random.random(10000)
    
    %timeit asd(u,w)
    %timeit asd2(u,w)
    
    The slowest run took 17796.43 times longer than the fastest. This could mean 
    that an intermediate result is being cached.
    100000 loops, best of 3: 6.06 µs per loop
    
    The slowest run took 29.94 times longer than the fastest. This could mean that 
    an intermediate result is being cached.
    100000 loops, best of 3: 5.11 µs per loop
    

    As far as parallel functionality, for this simple operation, you can use nb.vectorize:

    @nb.vectorize([nb.float64(nb.float64, nb.float64)], target='parallel')
    def asd3(x, y):
        return x + y
    
    u=np.random.random((100000, 10))
    w=np.random.random((100000, 10))
    
    %timeit asd(u,w)
    %timeit asd2(u,w)
    %timeit asd3(u,w)
    

    But again, if you operate on small arrays, you are going to be seeing the overhead of thread dispatch. For the array sizes above, I see the parallel giving me a 2x speedup.

    Where numba really shines is doing operations that are difficult to do in numpy using broadcasting, or when operations would result in a lot of temporary intermediate array allocations.