I'm trying to understand why my parallelized numba function is acting the way it does. In particular, why it is so sensitive to how arrays are being used.
I have the following function:
@njit(parallel=True)
def f(n):
g = lambda i,j: zeros(3) + sqrt(i*j)
x = zeros((n,3))
for i in prange(n):
for j in range(n):
tmp = g(i,j)
x[i] += tmp
return x
Trust that n is large enough for parallel computing to be useful. For some reason this actually runs faster with fewer cores. Now when I make a small change (x[i]
-> x[i, :]
).
@njit(parallel=True)
def f(n):
g = lambda i,j: zeros(3) + sqrt(i*j)
x = zeros((n,3))
for i in prange(n):
for j in range(n):
tmp = g(i,j)
x[i, :] += tmp
return x
The performance is significantly better, and it scales properly with the number of cores (ie. more cores is faster). Why does slicing make the performance better? To go even further, another change that makes a big difference is turning the lambda
function into and external njit function.
@njit
def g(i,j):
x = zeros(3) + sqrt(i*j)
return x
@njit(parallel=True)
def f(n):
x = zeros((n,3))
for i in prange(n):
for j in range(n):
tmp = g(i,j)
x[i, :] += tmp
return x
This again ruins the performance and scaling, reverting back to runtimes equal to or slower than the first case. Why does this external function ruin the performance? The performance can be recovered with two options shown below.
@njit
def g(i,j):
x = sqrt(i*j)
return x
@njit(parallel=True)
def f(n):
x = zeros((n,3))
for i in prange(n):
for j in range(n):
tmp = zeros(3) + g(i,j)
x[i, :] += tmp
return x
@njit(parallel=True)
def f(n):
def g(i,j):
x = zeros(3) + sqrt(i*j)
return x
x = zeros((n,3))
for i in prange(n):
for j in range(n):
tmp = g(i,j)
x[i, :] += tmp
return x
Why is the parallel=True
numba decorated function so sensitive to how arrays are being used? I know arrays are not trivially parallelizable, but the exact reason each of these changes dramatically effects performance isn't obvious to me.
TL;DR: allocations and inlining are certainly the source of the performance gap between the different version.
Operating on Numpy array is generally a bit more expensive than view in Numba. In this case, the problem appear to be that Numba perform an allocation when using x[i]
while it does not with x[i,:]
. The thing is allocations are expensive, especially in parallel codes since allocators tends not to scale (due to internal locks or atomic variables serializing the execution). I am not sure this is a missed optimization since x[i]
and x[i,:]
might have a slightly different behaviour.
In addition, Numba uses a JIT compiler (LLVM-Lite) which perform aggressive optimizations. LLVM is able to track allocations so to remove them in simple cases (like a function doing an allocation and freeing data just after in the same scope without side effects). The thing is Numba allocations are calling an external function that the compiler cannot optimize as it does not know the content at compile time (due to the way the Numba runtime interface currently works) and the function could theoretically have side effects.
To show what is happening, we need to delve into the assembly code. Overall, Numba generates a function for f
calling a xxx_numba_parfor_gufunc_xxx
function in N threads. This last function executes the content of the parallel loop. The caller function is the same for both implementation. The main computing function is different for the two version. Here is the assembly code on my machine:
----- WITHOUT VIEWS -----
_ZN13_3cdynamic_3e40__numba_parfor_gufunc_0x272d183ed00_2487B104c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Ai0vgkhqAgA_3dE5ArrayIyLi1E1C7mutable7alignedE23Literal_5bint_5d_283_29x5ArrayIdLi2E1C7mutable7alignedE:
.cfi_startproc
pushq %r15
.cfi_def_cfa_offset 16
pushq %r14
.cfi_def_cfa_offset 24
pushq %r13
.cfi_def_cfa_offset 32
pushq %r12
.cfi_def_cfa_offset 40
pushq %rsi
.cfi_def_cfa_offset 48
pushq %rdi
.cfi_def_cfa_offset 56
pushq %rbp
.cfi_def_cfa_offset 64
pushq %rbx
.cfi_def_cfa_offset 72
subq $280, %rsp
vmovaps %xmm6, 256(%rsp)
.cfi_def_cfa_offset 352
.cfi_offset %rbx, -72
.cfi_offset %rbp, -64
.cfi_offset %rdi, -56
.cfi_offset %rsi, -48
.cfi_offset %r12, -40
.cfi_offset %r13, -32
.cfi_offset %r14, -24
.cfi_offset %r15, -16
.cfi_offset %xmm6, -96
movq %rdx, 160(%rsp)
movq %rcx, 200(%rsp)
movq 504(%rsp), %r14
movq 488(%rsp), %r15
leaq -1(%r15), %rax
imulq %r14, %rax
xorl %ebp, %ebp
testq %rax, %rax
movq %rax, %rdx
cmovnsq %rbp, %rdx
cmpq $1, %r15
cmovbq %rbp, %rdx
movq %rdx, 240(%rsp)
movq %rax, %rdx
sarq $63, %rdx
andnq %rax, %rdx, %rax
addq 464(%rsp), %rax
movq %r15, %rbx
subq $1, %rbx
movq 440(%rsp), %rcx
movq 400(%rsp), %rsi
movabsq $NRT_incref, %rdx
cmovbq %rbp, %rax
movq %rax, 232(%rsp)
callq *%rdx
movq (%rsi), %rbp
movq 8(%rsi), %rdi
subq %rbp, %rdi
incq %rdi
movabsq $NRT_MemInfo_alloc_safe_aligned, %rsi
movl $24, %ecx
movl $32, %edx
callq *%rsi
movq %rax, 192(%rsp)
movq 24(%rax), %rax
movq %rax, 120(%rsp)
movl $24, %ecx
movl $32, %edx
callq *%rsi
movq %rax, 64(%rsp)
testq %rdi, %rdi
jle .LBB6_48
movq %rdi, %r11
movq %rbp, %r8
movq %rbx, %r10
movq %r15, %r9
movq 432(%rsp), %rdx
movq 472(%rsp), %rdi
movq %r15, %rax
imulq 464(%rsp), %rax
movq %rax, 208(%rsp)
xorl %eax, %eax
testq %rdx, %rdx
setg %al
movq %rdx, %rcx
sarq $63, %rcx
andnq %rdx, %rcx, %rcx
subq %rax, %rcx
movq %rcx, 224(%rsp)
leaq -4(%r15), %rax
movq %rax, 184(%rsp)
shrq $2, %rax
incq %rax
andl $7, %r15d
movq %r9, %r13
andq $-8, %r13
movq %r9, %rcx
andq $-4, %rcx
movq %rcx, 176(%rsp)
movl %eax, %ecx
andl $7, %ecx
movq %rbp, %rdx
imulq %r9, %rdx
movq %rcx, 168(%rsp)
shlq $5, %rcx
movq %rcx, 152(%rsp)
andq $-8, %rax
addq $-8, %rax
movq %rax, 144(%rsp)
movq %rax, %rcx
shrq $3, %rcx
incq %rcx
movq %rcx, %rax
movq %rcx, 136(%rsp)
andq $-2, %rcx
movq %rcx, 128(%rsp)
vxorps %xmm6, %xmm6, %xmm6
movq 64(%rsp), %rax
movq 24(%rax), %rax
movq %rax, 248(%rsp)
leaq 56(%rdi,%rdx,8), %rsi
leaq 224(%rdi,%rdx,8), %rcx
leaq (,%r9,8), %rax
movq %rax, 88(%rsp)
leaq (%rdi,%rdx,8), %rax
addq $480, %rax
movq %rax, 80(%rsp)
xorl %eax, %eax
movq %rax, 96(%rsp)
movq %rdx, 216(%rsp)
movq %rdx, 112(%rsp)
movq %rbx, 56(%rsp)
jmp .LBB6_3
.p2align 4, 0x90
.LBB6_2:
leaq -1(%r11), %rax
incq %r8
addq %r9, 112(%rsp)
movq 104(%rsp), %rcx
leaq (%rcx,%r9,8), %rcx
incq 96(%rsp)
movq 88(%rsp), %rdx
addq %rdx, %rsi
addq %rdx, 80(%rsp)
cmpq $2, %r11
movq %rax, %r11
jl .LBB6_48
.LBB6_3:
movq %rcx, 104(%rsp)
movq %r8, %rax
imulq %r9, %rax
movq 472(%rsp), %rdi
leaq (%rdi,%rax,8), %rbp
movq 240(%rsp), %rax
addq %rbp, %rax
movq 232(%rsp), %rcx
addq %rbp, %rcx
movq %r8, %rdx
imulq 496(%rsp), %rdx
movq 464(%rsp), %rbx
addq %rdx, %rbx
testq %r9, %r9
cmoveq %r9, %rdx
cmoveq %r9, %rbx
addq %rdi, %rdx
addq %rdi, %rbx
cmpq %rbx, %rax
setb 39(%rsp)
cmpq %rcx, %rdx
setb %al
cmpq $0, 432(%rsp)
jle .LBB6_2
cmpq 424(%rsp), %r9
jne .LBB6_46
movq 96(%rsp), %rcx
imulq %r9, %rcx
addq 216(%rsp), %rcx
andb %al, 39(%rsp)
movq 472(%rsp), %rax
leaq (%rax,%rcx,8), %rax
movq %rax, 72(%rsp)
movl $1, %eax
movq 224(%rsp), %rbx
xorl %r12d, %r12d
.p2align 4, 0x90
.LBB6_6:
imulq %r8, %r12
vcvtsi2sd %r12, %xmm2, %xmm0
vsqrtsd %xmm0, %xmm0, %xmm0
movq 120(%rsp), %rcx
vmovups %xmm6, (%rcx)
movq $0, 16(%rcx)
movq 248(%rsp), %rdx
vmovsd %xmm0, (%rdx)
vaddsd (%rbp), %xmm0, %xmm1
vmovsd %xmm1, (%rbp)
vaddsd 8(%rcx), %xmm0, %xmm1
vmovsd %xmm1, 8(%rdx)
vaddsd 8(%rbp), %xmm1, %xmm1
vmovsd %xmm1, 8(%rbp)
vaddsd 16(%rcx), %xmm0, %xmm0
vmovsd %xmm0, 16(%rdx)
movq %rax, %r12
vaddsd 16(%rbp), %xmm0, %xmm0
vmovsd %xmm0, 16(%rbp)
cmpb $0, 39(%rsp)
jne .LBB6_7
testq %r9, %r9
jle .LBB6_28
cmpq $7, %r10
jae .LBB6_19
xorl %eax, %eax
movq %rbp, %rdi
testq %r15, %r15
jne .LBB6_23
jmp .LBB6_26
.p2align 4, 0x90
.LBB6_19:
movq %rbp, %rcx
xorl %eax, %eax
.p2align 4, 0x90
.LBB6_20:
movq (%rcx), %rdx
movq %rdx, -56(%rsi,%rax,8)
leaq (%r14,%rcx), %rdx
movq (%r14,%rcx), %rcx
movq %rcx, -48(%rsi,%rax,8)
leaq (%r14,%rdx), %rcx
movq (%r14,%rdx), %rdx
movq %rdx, -40(%rsi,%rax,8)
leaq (%r14,%rcx), %rdx
movq (%r14,%rcx), %rcx
movq %rcx, -32(%rsi,%rax,8)
leaq (%r14,%rdx), %rcx
movq (%r14,%rdx), %rdx
movq %rdx, -24(%rsi,%rax,8)
leaq (%r14,%rcx), %rdx
movq (%r14,%rcx), %rcx
movq %rcx, -16(%rsi,%rax,8)
leaq (%r14,%rdx), %rdi
movq (%r14,%rdx), %rcx
movq %rcx, -8(%rsi,%rax,8)
leaq (%r14,%rdi), %rcx
movq (%r14,%rdi), %rdx
movq %rdx, (%rsi,%rax,8)
addq $8, %rax
addq %r14, %rcx
cmpq %rax, %r13
jne .LBB6_20
movq %r13, %rax
movq %rbp, %rdi
testq %r15, %r15
je .LBB6_26
.LBB6_23:
movq 112(%rsp), %rcx
addq %rax, %rcx
imulq %r14, %rax
addq %rbp, %rax
movq 472(%rsp), %rdx
leaq (%rdx,%rcx,8), %rcx
xorl %edx, %edx
.p2align 4, 0x90
.LBB6_24:
movq (%rax), %rdi
movq %rdi, (%rcx,%rdx,8)
incq %rdx
addq %r14, %rax
cmpq %rdx, %r15
jne .LBB6_24
movq %rbp, %rdi
.LBB6_26:
cmpb $0, 39(%rsp)
jne .LBB6_27
.LBB6_28:
xorl %eax, %eax
testq %rbx, %rbx
setg %al
movq %rbx, %rcx
subq %rax, %rcx
addq %r12, %rax
testq %rbx, %rbx
movq %rcx, %rbx
jg .LBB6_6
jmp .LBB6_2
.LBB6_7:
movq %r11, 48(%rsp)
movq %r8, 40(%rsp)
movq 208(%rsp), %rcx
movabsq $NRT_Allocate, %rax
vzeroupper
callq *%rax
movq 488(%rsp), %r9
movq %rax, %rdi
testq %r9, %r9
jle .LBB6_8
movq 56(%rsp), %r10
cmpq $7, %r10
movq 48(%rsp), %r11
jae .LBB6_11
xorl %eax, %eax
testq %r15, %r15
jne .LBB6_15
jmp .LBB6_17
.LBB6_8:
movq 40(%rsp), %r8
movq 48(%rsp), %r11
.LBB6_27:
movq %r8, 40(%rsp)
movq %rdi, %rcx
movq %r11, %rdi
movabsq $NRT_Free, %rax
vzeroupper
callq *%rax
movq %rdi, %r11
movq 40(%rsp), %r8
movq 56(%rsp), %r10
movq 488(%rsp), %r9
jmp .LBB6_28
.LBB6_11:
movq %rbp, %rcx
xorl %eax, %eax
.p2align 4, 0x90
.LBB6_12:
movq (%rcx), %rdx
movq %rdx, (%rdi,%rax,8)
leaq (%r14,%rcx), %rdx
movq (%r14,%rcx), %rcx
movq %rcx, 8(%rdi,%rax,8)
leaq (%r14,%rdx), %rcx
movq (%r14,%rdx), %rdx
movq %rdx, 16(%rdi,%rax,8)
leaq (%r14,%rcx), %rdx
movq (%r14,%rcx), %rcx
movq %rcx, 24(%rdi,%rax,8)
leaq (%r14,%rdx), %rcx
movq (%r14,%rdx), %rdx
movq %rdx, 32(%rdi,%rax,8)
leaq (%r14,%rcx), %rdx
movq (%r14,%rcx), %rcx
movq %rcx, 40(%rdi,%rax,8)
leaq (%r14,%rdx), %r8
movq (%r14,%rdx), %rcx
movq %rcx, 48(%rdi,%rax,8)
leaq (%r14,%r8), %rcx
movq (%r14,%r8), %rdx
movq %rdx, 56(%rdi,%rax,8)
addq $8, %rax
addq %r14, %rcx
cmpq %rax, %r13
jne .LBB6_12
movq %r13, %rax
testq %r15, %r15
je .LBB6_17
.LBB6_15:
leaq (%rdi,%rax,8), %r8
imulq %r14, %rax
addq %rbp, %rax
xorl %edx, %edx
.p2align 4, 0x90
.LBB6_16:
movq (%rax), %rcx
movq %rcx, (%r8,%rdx,8)
incq %rdx
addq %r14, %rax
cmpq %rdx, %r15
jne .LBB6_16
.LBB6_17:
testq %r9, %r9
jle .LBB6_18
cmpq $3, %r9
movq 40(%rsp), %r8
ja .LBB6_32
xorl %eax, %eax
jmp .LBB6_31
.LBB6_32:
cmpq $28, 184(%rsp)
jae .LBB6_34
xorl %eax, %eax
jmp .LBB6_40
.LBB6_34:
cmpq $0, 144(%rsp)
je .LBB6_35
movq 128(%rsp), %rcx
xorl %eax, %eax
movq 80(%rsp), %rdx
.p2align 4, 0x90
.LBB6_37:
vmovups (%rdi,%rax,8), %ymm0
vmovups %ymm0, -480(%rdx,%rax,8)
vmovups 32(%rdi,%rax,8), %ymm0
vmovups %ymm0, -448(%rdx,%rax,8)
vmovups 64(%rdi,%rax,8), %ymm0
vmovups %ymm0, -416(%rdx,%rax,8)
vmovups 96(%rdi,%rax,8), %ymm0
vmovups %ymm0, -384(%rdx,%rax,8)
vmovups 128(%rdi,%rax,8), %ymm0
vmovups %ymm0, -352(%rdx,%rax,8)
vmovups 160(%rdi,%rax,8), %ymm0
vmovups %ymm0, -320(%rdx,%rax,8)
vmovups 192(%rdi,%rax,8), %ymm0
vmovups %ymm0, -288(%rdx,%rax,8)
vmovups 224(%rdi,%rax,8), %ymm0
vmovups %ymm0, -256(%rdx,%rax,8)
vmovups 256(%rdi,%rax,8), %ymm0
vmovups %ymm0, -224(%rdx,%rax,8)
vmovups 288(%rdi,%rax,8), %ymm0
vmovups %ymm0, -192(%rdx,%rax,8)
vmovups 320(%rdi,%rax,8), %ymm0
vmovups %ymm0, -160(%rdx,%rax,8)
vmovups 352(%rdi,%rax,8), %ymm0
vmovups %ymm0, -128(%rdx,%rax,8)
vmovups 384(%rdi,%rax,8), %ymm0
vmovups %ymm0, -96(%rdx,%rax,8)
vmovups 416(%rdi,%rax,8), %ymm0
vmovups %ymm0, -64(%rdx,%rax,8)
vmovups 448(%rdi,%rax,8), %ymm0
vmovups %ymm0, -32(%rdx,%rax,8)
vmovupd 480(%rdi,%rax,8), %ymm0
vmovupd %ymm0, (%rdx,%rax,8)
addq $64, %rax
addq $-2, %rcx
jne .LBB6_37
testb $1, 136(%rsp)
je .LBB6_40
.LBB6_39:
vmovups (%rdi,%rax,8), %ymm0
movq 104(%rsp), %rcx
vmovups %ymm0, -224(%rcx,%rax,8)
vmovups 32(%rdi,%rax,8), %ymm0
vmovups %ymm0, -192(%rcx,%rax,8)
vmovups 64(%rdi,%rax,8), %ymm0
vmovups %ymm0, -160(%rcx,%rax,8)
vmovups 96(%rdi,%rax,8), %ymm0
vmovups %ymm0, -128(%rcx,%rax,8)
vmovups 128(%rdi,%rax,8), %ymm0
vmovups %ymm0, -96(%rcx,%rax,8)
vmovups 160(%rdi,%rax,8), %ymm0
vmovups %ymm0, -64(%rcx,%rax,8)
vmovups 192(%rdi,%rax,8), %ymm0
vmovups %ymm0, -32(%rcx,%rax,8)
vmovupd 224(%rdi,%rax,8), %ymm0
vmovupd %ymm0, (%rcx,%rax,8)
addq $32, %rax
.LBB6_40:
cmpq $0, 168(%rsp)
je .LBB6_42
movq 72(%rsp), %rcx
leaq (%rcx,%rax,8), %rcx
leaq (%rdi,%rax,8), %rdx
movq 152(%rsp), %r8
movabsq $memcpy, %rax
vzeroupper
callq *%rax
movq 48(%rsp), %r11
movq 40(%rsp), %r8
movq 488(%rsp), %r9
.LBB6_42:
movq 176(%rsp), %rcx
movq %rcx, %rax
cmpq %r9, %rcx
movq 56(%rsp), %r10
je .LBB6_26
.LBB6_31:
movq 72(%rsp), %rcx
leaq (%rcx,%rax,8), %rcx
leaq (%rdi,%rax,8), %rdx
shlq $3, %rax
movq 88(%rsp), %r8
subq %rax, %r8
movabsq $memcpy, %rax
vzeroupper
callq *%rax
movq 48(%rsp), %r11
movq 40(%rsp), %r8
movq 56(%rsp), %r10
movq 488(%rsp), %r9
jmp .LBB6_26
.LBB6_35:
xorl %eax, %eax
testb $1, 136(%rsp)
jne .LBB6_39
jmp .LBB6_40
.LBB6_18:
movq 40(%rsp), %r8
jmp .LBB6_26
.LBB6_48:
movabsq $NRT_decref, %rsi
movq 440(%rsp), %rcx
vzeroupper
callq *%rsi
movq 192(%rsp), %rcx
callq *%rsi
movq 64(%rsp), %rcx
callq *%rsi
movq 200(%rsp), %rax
movq $0, (%rax)
xorl %eax, %eax
jmp .LBB6_47
.LBB6_46:
vxorps %xmm0, %xmm0, %xmm0
movq 120(%rsp), %rax
vmovups %xmm0, (%rax)
movq $0, 16(%rax)
movq 440(%rsp), %rcx
movabsq $NRT_incref, %rax
vzeroupper
callq *%rax
movabsq $.const.picklebuf.2691622873664, %rax
movq 160(%rsp), %rcx
movq %rax, (%rcx)
movl $1, %eax
.LBB6_47:
vmovaps 256(%rsp), %xmm6
addq $280, %rsp
popq %rbx
popq %rbp
popq %rdi
popq %rsi
popq %r12
popq %r13
popq %r14
popq %r15
retq
.Lfunc_end6:
.size _ZN13_3cdynamic_3e40__numba_parfor_gufunc_0x272d183ed00_2487B104c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Ai0vgkhqAgA_3dE5ArrayIyLi1E1C7mutable7alignedE23Literal_5bint_5d_283_29x5ArrayIdLi2E1C7mutable7alignedE, .Lfunc_end6-_ZN13_3cdynamic_3e40__numba_parfor_gufunc_0x272d183ed00_2487B104c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Ai0vgkhqAgA_3dE5ArrayIyLi1E1C7mutable7alignedE23Literal_5bint_5d_283_29x5ArrayIdLi2E1C7mutable7alignedE
.cfi_endproc
.weak NRT_incref
.p2align 4, 0x90
.type NRT_incref,@function
NRT_incref:
testq %rcx, %rcx
je .LBB7_1
lock incq (%rcx)
retq
.LBB7_1:
retq
.Lfunc_end7:
----- WITH VIEWS -----
__gufunc__._ZN13_3cdynamic_3e40__numba_parfor_gufunc_0x272bef6ed00_2491B104c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Ai0vgkhqAgA_3dE5ArrayIyLi1E1C7mutable7alignedE23Literal_5bint_5d_283_29x5ArrayIdLi2E1C7mutable7alignedE:
.cfi_startproc
pushq %r15
.cfi_def_cfa_offset 16
pushq %r14
.cfi_def_cfa_offset 24
pushq %r13
.cfi_def_cfa_offset 32
pushq %r12
.cfi_def_cfa_offset 40
pushq %rsi
.cfi_def_cfa_offset 48
pushq %rdi
.cfi_def_cfa_offset 56
pushq %rbp
.cfi_def_cfa_offset 64
pushq %rbx
.cfi_def_cfa_offset 72
subq $168, %rsp
vmovaps %xmm6, 144(%rsp)
.cfi_def_cfa_offset 240
.cfi_offset %rbx, -72
.cfi_offset %rbp, -64
.cfi_offset %rdi, -56
.cfi_offset %rsi, -48
.cfi_offset %r12, -40
.cfi_offset %r13, -32
.cfi_offset %r14, -24
.cfi_offset %r15, -16
.cfi_offset %xmm6, -96
movq (%rdx), %rax
movq 24(%rdx), %r12
movq (%rcx), %rdx
movq %rdx, 120(%rsp)
movq 8(%rcx), %rdx
movq %rdx, 112(%rsp)
movq (%r8), %rdx
movq %rdx, 104(%rsp)
movq 8(%r8), %rdx
movq %rdx, 96(%rsp)
movq 16(%rcx), %rdx
movq %rdx, 88(%rsp)
movq 16(%r8), %rdx
movq %rdx, 80(%rsp)
movq 24(%rcx), %rcx
movq %rcx, 64(%rsp)
movq 24(%r8), %rcx
movq %rcx, 56(%rsp)
movl $0, 36(%rsp)
movq %rax, 72(%rsp)
testq %rax, %rax
jle .LBB5_12
cmpq $4, %r12
movl $3, %eax
cmovlq %r12, %rax
movq %rax, 48(%rsp)
movq %r12, %rbx
sarq $63, %rbx
andq %r12, %rbx
xorl %eax, %eax
vxorps %xmm6, %xmm6, %xmm6
jmp .LBB5_2
.p2align 4, 0x90
.LBB5_9:
movq 136(%rsp), %rcx
movabsq $NRT_decref, %rsi
movq %r9, %rdi
callq *%rsi
movq %rdi, %rcx
callq *%rsi
movq 40(%rsp), %rax
incq %rax
cmpq 72(%rsp), %rax
je .LBB5_12
.LBB5_2:
movq %rax, %rbp
imulq 104(%rsp), %rbp
movq %rax, %rcx
imulq 96(%rsp), %rcx
movq 112(%rsp), %rdx
movq (%rdx,%rcx), %rcx
movq %rcx, 128(%rsp)
movq %rax, 40(%rsp)
movq %rax, %rcx
imulq 80(%rsp), %rcx
movq 88(%rsp), %rdx
movq (%rdx,%rcx), %rdi
movq 120(%rsp), %rcx
movq (%rbp,%rcx), %r13
movq 8(%rbp,%rcx), %r14
subq %r13, %r14
incq %r14
movl $24, %ecx
movl $32, %edx
movabsq $NRT_MemInfo_alloc_safe_aligned, %rsi
callq *%rsi
movq %rax, 136(%rsp)
movq 24(%rax), %r15
movl $24, %ecx
movl $32, %edx
callq *%rsi
movq %rax, %r9
testq %r14, %r14
jle .LBB5_9
xorl %edx, %edx
testq %rdi, %rdi
setg %r8b
testq %rdi, %rdi
jle .LBB5_9
movq 128(%rsp), %rax
cmpq %rax, 48(%rsp)
jne .LBB5_10
movq 40(%rsp), %rax
imulq 56(%rsp), %rax
addq 64(%rsp), %rax
movq 24(%r9), %rcx
movb %r8b, %dl
negq %rdx
leaq (%rdi,%rdx), %rsi
incq %rsi
.p2align 4, 0x90
.LBB5_6:
movq %r13, %rdi
imulq %r12, %rdi
addq %rbx, %rdi
movq %rsi, %rdx
xorl %ebp, %ebp
.p2align 4, 0x90
.LBB5_7:
vcvtsi2sd %rbp, %xmm2, %xmm0
vsqrtsd %xmm0, %xmm0, %xmm0
vmovups %xmm6, (%r15)
movq $0, 16(%r15)
vmovsd %xmm0, (%rcx)
vaddsd (%rax,%rdi,8), %xmm0, %xmm1
vmovsd %xmm1, (%rax,%rdi,8)
vaddsd 8(%r15), %xmm0, %xmm1
vmovsd %xmm1, 8(%rcx)
vaddsd 8(%rax,%rdi,8), %xmm1, %xmm1
vmovsd %xmm1, 8(%rax,%rdi,8)
vaddsd 16(%r15), %xmm0, %xmm0
vmovsd %xmm0, 16(%rcx)
vaddsd 16(%rax,%rdi,8), %xmm0, %xmm0
vmovsd %xmm0, 16(%rax,%rdi,8)
addq %r13, %rbp
decq %rdx
testq %rdx, %rdx
jg .LBB5_7
leaq -1(%r14), %rdx
incq %r13
cmpq $1, %r14
movq %rdx, %r14
jg .LBB5_6
jmp .LBB5_9
.LBB5_10:
vxorps %xmm0, %xmm0, %xmm0
vmovups %xmm0, (%r15)
movq $0, 16(%r15)
movabsq $numba_gil_ensure, %rax
leaq 36(%rsp), %rcx
callq *%rax
movabsq $PyErr_Clear, %rax
callq *%rax
movabsq $.const.pickledata.2691858029760, %rcx
movabsq $.const.pickledata.2691858029760.sha1, %r8
movabsq $numba_unpickle, %rax
movl $180, %edx
callq *%rax
testq %rax, %rax
je .LBB5_11
movabsq $numba_do_raise, %rdx
movq %rax, %rcx
callq *%rdx
.LBB5_11:
movabsq $numba_gil_release, %rax
leaq 36(%rsp), %rcx
callq *%rax
.LBB5_12:
vmovaps 144(%rsp), %xmm6
addq $168, %rsp
popq %rbx
popq %rbp
popq %rdi
popq %rsi
popq %r12
popq %r13
popq %r14
popq %r15
retq
.Lfunc_end5:
The code of the first version is huge compared to the second part. Overall, we can see that the most computational part is about the same:
----- WITHOUT VIEWS -----
.LBB6_6:
imulq %r8, %r12
vcvtsi2sd %r12, %xmm2, %xmm0
vsqrtsd %xmm0, %xmm0, %xmm0
movq 120(%rsp), %rcx
vmovups %xmm6, (%rcx)
movq $0, 16(%rcx)
movq 248(%rsp), %rdx
vmovsd %xmm0, (%rdx)
vaddsd (%rbp), %xmm0, %xmm1
vmovsd %xmm1, (%rbp)
vaddsd 8(%rcx), %xmm0, %xmm1
vmovsd %xmm1, 8(%rdx)
vaddsd 8(%rbp), %xmm1, %xmm1
vmovsd %xmm1, 8(%rbp)
vaddsd 16(%rcx), %xmm0, %xmm0
vmovsd %xmm0, 16(%rdx)
movq %rax, %r12
vaddsd 16(%rbp), %xmm0, %xmm0
vmovsd %xmm0, 16(%rbp)
cmpb $0, 39(%rsp)
jne .LBB6_7
----- WITH VIEWS -----
.LBB5_7:
vcvtsi2sd %rbp, %xmm2, %xmm0
vsqrtsd %xmm0, %xmm0, %xmm0
vmovups %xmm6, (%r15)
movq $0, 16(%r15)
vmovsd %xmm0, (%rcx)
vaddsd (%rax,%rdi,8), %xmm0, %xmm1
vmovsd %xmm1, (%rax,%rdi,8)
vaddsd 8(%r15), %xmm0, %xmm1
vmovsd %xmm1, 8(%rcx)
vaddsd 8(%rax,%rdi,8), %xmm1, %xmm1
vmovsd %xmm1, 8(%rax,%rdi,8)
vaddsd 16(%r15), %xmm0, %xmm0
vmovsd %xmm0, 16(%rcx)
vaddsd 16(%rax,%rdi,8), %xmm0, %xmm0
vmovsd %xmm0, 16(%rax,%rdi,8)
addq %r13, %rbp
decq %rdx
testq %rdx, %rdx
jg .LBB5_7
While the code of the first version is a bit less efficient than the second one, the difference is certainly far from being sufficient to explain the huge gap in the timings (~65 ms VS <0.6ms).
We can also see that the function calls in the assembly code are different between the two versions:
----- WITHOUT VIEWS -----
memcpy
NRT_Allocate
NRT_Free
NRT_decref
NRT_incref
NRT_MemInfo_alloc_safe_aligned
----- WITH VIEWS -----
numba_do_raise
numba_gil_ensure
numba_gil_release
numba_unpickle
PyErr_Clear
NRT_decref
NRT_MemInfo_alloc_safe_aligned
The NRT_Allocate
, NRT_Free
, NRT_decref
, NRT_incref
function calls indicate that the compiled code create a new Python object in the middle of the hot loop which is very inefficient. Meanwhile, the second version does not perform any NRT_incref
and I suspect NRT_decref
is never actually called (or maybe just once). The second code performs no Numpy array allocations. It looks like the calls to PyErr_Clear
, numba_do_raise
and numba_unpickle
are made to manage exception that can possibly be raised (but not in the first version surprizingly so it is likely related to the use of views). Finally, the call to memcpy
in the first version shows that the newly created array is certainly copied to the x
. The allocation and the copy makes the first version very inefficient.
I am pretty surprized that Numba does not generate allocations for zeros(3)
. This is grea, but you should really avoid creating arrays in hot loops like this since there is no garantee Numba will always optimize such a call. In fact, it often does not.
You can use a basic loop to copy all items of a slice so to avoid any allocation. This is often faster if the size of the slice is known at compile time. Slice copies could be faster since the copiler might better vectorize the code, but in practice, such loops are relatively well auto-vectorized.
One can note that there is the vsqrtsd
instruction in the code of both versions so the lambda is actually inlined.
When you move the lambda away of the function and put its content in another jitted function, LLVM may not inline the function. You can request Numba to inline the function manually before the generation of the intermediate representation (IR code) so that LLVM should generate a similar code. This can be done using the inline="always"
flag. This tends to increase the compilation time though (since the code is nearly copy-pasted in the caller function). Inlining is critical for applying many further optimizations (constant propagation, SIMD vectorization, etc.) which can result in a huge performance boost.