Search code examples
pythonperformancecpythonnumba

Multiply function slower in Numba than CPython


I wrote the following code in python

from numba import *

def mul(a, b):
    return a * b

@jit
def numba_mul(a, b):
    return a * b


@jit(int_(int_, int_))
def numba_mul2(a, b):
    return a * b

and got the following result

In [3]: %timeit mul(10, 10)
10000000 loops, best of 3: 124 ns per loop

In [4]: %timeit numba_mul(10, 10)
1000000 loops, best of 3: 271 ns per loop

In [5]: %timeit numba_mul2(10, 10)
1000000 loops, best of 3: 263 ns per loop

Why is CPython more than 2x as fast as Numba? This is in Python 2.7.7 on OSX 10.9.3 with LLVM 3.2

If it helps, the llvm dump (obtained using numba --annotate --llvmp-dump main.py) is below

----------------LLVM DUMP <function descriptor 'numba_mul2$30'>-----------------
; ModuleID = 'module.numba_mul2$30'

define i32 @numba_mul2.int64.int64(i64*, i64 %arg.a, i64 %arg.b) {
entry:
  %a = alloca i64
  store i64 %arg.a, i64* %a
  %b = alloca i64
  store i64 %arg.b, i64* %b
  %a.1 = alloca i64
  %b.1 = alloca i64
  %"$0.1" = alloca i64
  br label %B0

B0:                                               ; preds = %entry
  %1 = load i64* %a
  store i64 %1, i64* %a.1
  %2 = load i64* %b
  store i64 %2, i64* %b.1
  %3 = load i64* %a.1
  %4 = load i64* %b.1
  %5 = mul i64 %3, %4
  store i64 %5, i64* %"$0.1"
  %6 = load i64* %"$0.1"
  store i64 %6, i64* %0
  ret i32 0
}

!python.module = !{!0}

!0 = metadata !{metadata !"__main__"}

================================================================================
-----------------------------------ANNOTATION-----------------------------------
# File: main.py
# --- LINE 30 --- 

@jit(int_(int_, int_))

# --- LINE 31 --- 

def numba_mul2(a, b):

    # --- LINE 32 --- 
    # label 0
    #   a.1 = a  :: int64
    #   b.1 = b  :: int64
    #   $0.1 = a.1 * b.1  :: int64
    #   return $0.1

    return a * b


================================================================================

Solution

  • Your test is too small to yield meaningful results. If you just do something like:

    def f():
        pass
    %timeit f()
    

    You'll likely get a time that's a significant fraction of your runtimes. On my machine it's a little more than half the time for the mul function.

    Additionally, numba has to look up which function to dispatch based on your arguments, then convert your python integer to an int32/int64, then make a new python object wrapping the result to return.

    Try testing after removing the box/unbox overhead:

    from numba import *
    
    def mul(a, b):
        for i in range(1000):
            a * b
    
    @jit
    def numba_mul(a, b):
        for i in range(1000):
            a * b
    
    
    @jit(int_(int_, int_))
    def numba_mul2(a, b):
        for i in range(1000):
            a * b
        return a*b
    

    I get 81.1 µs, 330 ns, and 322 ns respectively on my machine for this test.

    EDIT: I was curious about the overhead of empty function calls in numba, so I added the following tests:

    def empty(a, b):
        pass
    
    @jit
    def numba_empty(a, b):
        pass
    
    @jit
    def numba_empty2(a, b):
        numba_empty(a,b)
    

    I get 155ns, 333ns, and 348ns for this test. It seems the numba->numba call overhead is very small.