Why do small changes have dramatic effects on the runtime of my numba parallel function?

I'm trying to understand why my parallelized numba function is acting the way it does. In particular, why it is so sensitive to how arrays are being used.

I have the following function:

@njit(parallel=True)
def f(n):
    g = lambda i,j: zeros(3) + sqrt(i*j)
    x = zeros((n,3))
    for i in prange(n):
        for j in range(n):
            tmp      = g(i,j)
            x[i] += tmp
    return x

Trust that n is large enough for parallel computing to be useful. For some reason this actually runs faster with fewer cores. Now when I make a small change (x[i] -> x[i, :]).

@njit(parallel=True)
def f(n):
    g = lambda i,j: zeros(3) + sqrt(i*j)
    x = zeros((n,3))
    for i in prange(n):
        for j in range(n):
            tmp      = g(i,j)
            x[i, :] += tmp
    return x

The performance is significantly better, and it scales properly with the number of cores (ie. more cores is faster). Why does slicing make the performance better? To go even further, another change that makes a big difference is turning the lambda function into and external njit function.

@njit
def g(i,j):
    x = zeros(3) + sqrt(i*j)
    return x

@njit(parallel=True)
def f(n):
    x = zeros((n,3))
    for i in prange(n):
        for j in range(n):
            tmp      = g(i,j)
            x[i, :] += tmp
    return x

This again ruins the performance and scaling, reverting back to runtimes equal to or slower than the first case. Why does this external function ruin the performance? The performance can be recovered with two options shown below.

@njit
def g(i,j):
    x = sqrt(i*j)
    return x

@njit(parallel=True)
def f(n):
    x = zeros((n,3))
    for i in prange(n):
        for j in range(n):
            tmp      = zeros(3) + g(i,j)
            x[i, :] += tmp
    return x

@njit(parallel=True)
def f(n):
    def g(i,j):
        x = zeros(3) + sqrt(i*j)
        return x
    x = zeros((n,3))
    for i in prange(n):
        for j in range(n):
            tmp      = g(i,j)
            x[i, :] += tmp
    return x

Why is the parallel=True numba decorated function so sensitive to how arrays are being used? I know arrays are not trivially parallelizable, but the exact reason each of these changes dramatically effects performance isn't obvious to me.

Solution

TL;DR: allocations and inlining are certainly the source of the performance gap between the different version.

Operating on Numpy array is generally a bit more expensive than view in Numba. In this case, the problem appear to be that Numba perform an allocation when using x[i] while it does not with x[i,:]. The thing is allocations are expensive, especially in parallel codes since allocators tends not to scale (due to internal locks or atomic variables serializing the execution). I am not sure this is a missed optimization since x[i] and x[i,:] might have a slightly different behaviour.

In addition, Numba uses a JIT compiler (LLVM-Lite) which perform aggressive optimizations. LLVM is able to track allocations so to remove them in simple cases (like a function doing an allocation and freeing data just after in the same scope without side effects). The thing is Numba allocations are calling an external function that the compiler cannot optimize as it does not know the content at compile time (due to the way the Numba runtime interface currently works) and the function could theoretically have side effects.

To show what is happening, we need to delve into the assembly code. Overall, Numba generates a function for f calling a xxx_numba_parfor_gufunc_xxx function in N threads. This last function executes the content of the parallel loop. The caller function is the same for both implementation. The main computing function is different for the two version. Here is the assembly code on my machine:

    -----  WITHOUT VIEWS  ----- 

_ZN13_3cdynamic_3e40__numba_parfor_gufunc_0x272d183ed00_2487B104c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Ai0vgkhqAgA_3dE5ArrayIyLi1E1C7mutable7alignedE23Literal_5bint_5d_283_29x5ArrayIdLi2E1C7mutable7alignedE:
        .cfi_startproc
        pushq   %r15
        .cfi_def_cfa_offset 16
        pushq   %r14
        .cfi_def_cfa_offset 24
        pushq   %r13
        .cfi_def_cfa_offset 32
        pushq   %r12
        .cfi_def_cfa_offset 40
        pushq   %rsi
        .cfi_def_cfa_offset 48
        pushq   %rdi
        .cfi_def_cfa_offset 56
        pushq   %rbp
        .cfi_def_cfa_offset 64
        pushq   %rbx
        .cfi_def_cfa_offset 72
        subq    $280, %rsp
        vmovaps %xmm6, 256(%rsp)
        .cfi_def_cfa_offset 352
        .cfi_offset %rbx, -72
        .cfi_offset %rbp, -64
        .cfi_offset %rdi, -56
        .cfi_offset %rsi, -48
        .cfi_offset %r12, -40
        .cfi_offset %r13, -32
        .cfi_offset %r14, -24
        .cfi_offset %r15, -16
        .cfi_offset %xmm6, -96
        movq    %rdx, 160(%rsp)
        movq    %rcx, 200(%rsp)
        movq    504(%rsp), %r14
        movq    488(%rsp), %r15
        leaq    -1(%r15), %rax
        imulq   %r14, %rax
        xorl    %ebp, %ebp
        testq   %rax, %rax
        movq    %rax, %rdx
        cmovnsq %rbp, %rdx
        cmpq    $1, %r15
        cmovbq  %rbp, %rdx
        movq    %rdx, 240(%rsp)
        movq    %rax, %rdx
        sarq    $63, %rdx
        andnq   %rax, %rdx, %rax
        addq    464(%rsp), %rax
        movq    %r15, %rbx
        subq    $1, %rbx
        movq    440(%rsp), %rcx
        movq    400(%rsp), %rsi
        movabsq $NRT_incref, %rdx
        cmovbq  %rbp, %rax
        movq    %rax, 232(%rsp)
        callq   *%rdx
        movq    (%rsi), %rbp
        movq    8(%rsi), %rdi
        subq    %rbp, %rdi
        incq    %rdi
        movabsq $NRT_MemInfo_alloc_safe_aligned, %rsi
        movl    $24, %ecx
        movl    $32, %edx
        callq   *%rsi
        movq    %rax, 192(%rsp)
        movq    24(%rax), %rax
        movq    %rax, 120(%rsp)
        movl    $24, %ecx
        movl    $32, %edx
        callq   *%rsi
        movq    %rax, 64(%rsp)
        testq   %rdi, %rdi
        jle     .LBB6_48
        movq    %rdi, %r11
        movq    %rbp, %r8
        movq    %rbx, %r10
        movq    %r15, %r9
        movq    432(%rsp), %rdx
        movq    472(%rsp), %rdi
        movq    %r15, %rax
        imulq   464(%rsp), %rax
        movq    %rax, 208(%rsp)
        xorl    %eax, %eax
        testq   %rdx, %rdx
        setg    %al
        movq    %rdx, %rcx
        sarq    $63, %rcx
        andnq   %rdx, %rcx, %rcx
        subq    %rax, %rcx
        movq    %rcx, 224(%rsp)
        leaq    -4(%r15), %rax
        movq    %rax, 184(%rsp)
        shrq    $2, %rax
        incq    %rax
        andl    $7, %r15d
        movq    %r9, %r13
        andq    $-8, %r13
        movq    %r9, %rcx
        andq    $-4, %rcx
        movq    %rcx, 176(%rsp)
        movl    %eax, %ecx
        andl    $7, %ecx
        movq    %rbp, %rdx
        imulq   %r9, %rdx
        movq    %rcx, 168(%rsp)
        shlq    $5, %rcx
        movq    %rcx, 152(%rsp)
        andq    $-8, %rax
        addq    $-8, %rax
        movq    %rax, 144(%rsp)
        movq    %rax, %rcx
        shrq    $3, %rcx
        incq    %rcx
        movq    %rcx, %rax
        movq    %rcx, 136(%rsp)
        andq    $-2, %rcx
        movq    %rcx, 128(%rsp)
        vxorps  %xmm6, %xmm6, %xmm6
        movq    64(%rsp), %rax
        movq    24(%rax), %rax
        movq    %rax, 248(%rsp)
        leaq    56(%rdi,%rdx,8), %rsi
        leaq    224(%rdi,%rdx,8), %rcx
        leaq    (,%r9,8), %rax
        movq    %rax, 88(%rsp)
        leaq    (%rdi,%rdx,8), %rax
        addq    $480, %rax
        movq    %rax, 80(%rsp)
        xorl    %eax, %eax
        movq    %rax, 96(%rsp)
        movq    %rdx, 216(%rsp)
        movq    %rdx, 112(%rsp)
        movq    %rbx, 56(%rsp)
        jmp     .LBB6_3
        .p2align        4, 0x90
.LBB6_2:
        leaq    -1(%r11), %rax
        incq    %r8
        addq    %r9, 112(%rsp)
        movq    104(%rsp), %rcx
        leaq    (%rcx,%r9,8), %rcx
        incq    96(%rsp)
        movq    88(%rsp), %rdx
        addq    %rdx, %rsi
        addq    %rdx, 80(%rsp)
        cmpq    $2, %r11
        movq    %rax, %r11
        jl      .LBB6_48
.LBB6_3:
        movq    %rcx, 104(%rsp)
        movq    %r8, %rax
        imulq   %r9, %rax
        movq    472(%rsp), %rdi
        leaq    (%rdi,%rax,8), %rbp
        movq    240(%rsp), %rax
        addq    %rbp, %rax
        movq    232(%rsp), %rcx
        addq    %rbp, %rcx
        movq    %r8, %rdx
        imulq   496(%rsp), %rdx
        movq    464(%rsp), %rbx
        addq    %rdx, %rbx
        testq   %r9, %r9
        cmoveq  %r9, %rdx
        cmoveq  %r9, %rbx
        addq    %rdi, %rdx
        addq    %rdi, %rbx
        cmpq    %rbx, %rax
        setb    39(%rsp)
        cmpq    %rcx, %rdx
        setb    %al
        cmpq    $0, 432(%rsp)
        jle     .LBB6_2
        cmpq    424(%rsp), %r9
        jne     .LBB6_46
        movq    96(%rsp), %rcx
        imulq   %r9, %rcx
        addq    216(%rsp), %rcx
        andb    %al, 39(%rsp)
        movq    472(%rsp), %rax
        leaq    (%rax,%rcx,8), %rax
        movq    %rax, 72(%rsp)
        movl    $1, %eax
        movq    224(%rsp), %rbx
        xorl    %r12d, %r12d
        .p2align        4, 0x90
.LBB6_6:
        imulq   %r8, %r12
        vcvtsi2sd       %r12, %xmm2, %xmm0
        vsqrtsd %xmm0, %xmm0, %xmm0
        movq    120(%rsp), %rcx
        vmovups %xmm6, (%rcx)
        movq    $0, 16(%rcx)
        movq    248(%rsp), %rdx
        vmovsd  %xmm0, (%rdx)
        vaddsd  (%rbp), %xmm0, %xmm1
        vmovsd  %xmm1, (%rbp)
        vaddsd  8(%rcx), %xmm0, %xmm1
        vmovsd  %xmm1, 8(%rdx)
        vaddsd  8(%rbp), %xmm1, %xmm1
        vmovsd  %xmm1, 8(%rbp)
        vaddsd  16(%rcx), %xmm0, %xmm0
        vmovsd  %xmm0, 16(%rdx)
        movq    %rax, %r12
        vaddsd  16(%rbp), %xmm0, %xmm0
        vmovsd  %xmm0, 16(%rbp)
        cmpb    $0, 39(%rsp)
        jne     .LBB6_7
        testq   %r9, %r9
        jle     .LBB6_28
        cmpq    $7, %r10
        jae     .LBB6_19
        xorl    %eax, %eax
        movq    %rbp, %rdi
        testq   %r15, %r15
        jne     .LBB6_23
        jmp     .LBB6_26
        .p2align        4, 0x90
.LBB6_19:
        movq    %rbp, %rcx
        xorl    %eax, %eax
        .p2align        4, 0x90
.LBB6_20:
        movq    (%rcx), %rdx
        movq    %rdx, -56(%rsi,%rax,8)
        leaq    (%r14,%rcx), %rdx
        movq    (%r14,%rcx), %rcx
        movq    %rcx, -48(%rsi,%rax,8)
        leaq    (%r14,%rdx), %rcx
        movq    (%r14,%rdx), %rdx
        movq    %rdx, -40(%rsi,%rax,8)
        leaq    (%r14,%rcx), %rdx
        movq    (%r14,%rcx), %rcx
        movq    %rcx, -32(%rsi,%rax,8)
        leaq    (%r14,%rdx), %rcx
        movq    (%r14,%rdx), %rdx
        movq    %rdx, -24(%rsi,%rax,8)
        leaq    (%r14,%rcx), %rdx
        movq    (%r14,%rcx), %rcx
        movq    %rcx, -16(%rsi,%rax,8)
        leaq    (%r14,%rdx), %rdi
        movq    (%r14,%rdx), %rcx
        movq    %rcx, -8(%rsi,%rax,8)
        leaq    (%r14,%rdi), %rcx
        movq    (%r14,%rdi), %rdx
        movq    %rdx, (%rsi,%rax,8)
        addq    $8, %rax
        addq    %r14, %rcx
        cmpq    %rax, %r13
        jne     .LBB6_20
        movq    %r13, %rax
        movq    %rbp, %rdi
        testq   %r15, %r15
        je      .LBB6_26
.LBB6_23:
        movq    112(%rsp), %rcx
        addq    %rax, %rcx
        imulq   %r14, %rax
        addq    %rbp, %rax
        movq    472(%rsp), %rdx
        leaq    (%rdx,%rcx,8), %rcx
        xorl    %edx, %edx
        .p2align        4, 0x90
.LBB6_24:
        movq    (%rax), %rdi
        movq    %rdi, (%rcx,%rdx,8)
        incq    %rdx
        addq    %r14, %rax
        cmpq    %rdx, %r15
        jne     .LBB6_24
        movq    %rbp, %rdi
.LBB6_26:
        cmpb    $0, 39(%rsp)
        jne     .LBB6_27
.LBB6_28:
        xorl    %eax, %eax
        testq   %rbx, %rbx
        setg    %al
        movq    %rbx, %rcx
        subq    %rax, %rcx
        addq    %r12, %rax
        testq   %rbx, %rbx
        movq    %rcx, %rbx
        jg      .LBB6_6
        jmp     .LBB6_2
.LBB6_7:
        movq    %r11, 48(%rsp)
        movq    %r8, 40(%rsp)
        movq    208(%rsp), %rcx
        movabsq $NRT_Allocate, %rax
        vzeroupper
        callq   *%rax
        movq    488(%rsp), %r9
        movq    %rax, %rdi
        testq   %r9, %r9
        jle     .LBB6_8
        movq    56(%rsp), %r10
        cmpq    $7, %r10
        movq    48(%rsp), %r11
        jae     .LBB6_11
        xorl    %eax, %eax
        testq   %r15, %r15
        jne     .LBB6_15
        jmp     .LBB6_17
.LBB6_8:
        movq    40(%rsp), %r8
        movq    48(%rsp), %r11
.LBB6_27:
        movq    %r8, 40(%rsp)
        movq    %rdi, %rcx
        movq    %r11, %rdi
        movabsq $NRT_Free, %rax
        vzeroupper
        callq   *%rax
        movq    %rdi, %r11
        movq    40(%rsp), %r8
        movq    56(%rsp), %r10
        movq    488(%rsp), %r9
        jmp     .LBB6_28
.LBB6_11:
        movq    %rbp, %rcx
        xorl    %eax, %eax
        .p2align        4, 0x90
.LBB6_12:
        movq    (%rcx), %rdx
        movq    %rdx, (%rdi,%rax,8)
        leaq    (%r14,%rcx), %rdx
        movq    (%r14,%rcx), %rcx
        movq    %rcx, 8(%rdi,%rax,8)
        leaq    (%r14,%rdx), %rcx
        movq    (%r14,%rdx), %rdx
        movq    %rdx, 16(%rdi,%rax,8)
        leaq    (%r14,%rcx), %rdx
        movq    (%r14,%rcx), %rcx
        movq    %rcx, 24(%rdi,%rax,8)
        leaq    (%r14,%rdx), %rcx
        movq    (%r14,%rdx), %rdx
        movq    %rdx, 32(%rdi,%rax,8)
        leaq    (%r14,%rcx), %rdx
        movq    (%r14,%rcx), %rcx
        movq    %rcx, 40(%rdi,%rax,8)
        leaq    (%r14,%rdx), %r8
        movq    (%r14,%rdx), %rcx
        movq    %rcx, 48(%rdi,%rax,8)
        leaq    (%r14,%r8), %rcx
        movq    (%r14,%r8), %rdx
        movq    %rdx, 56(%rdi,%rax,8)
        addq    $8, %rax
        addq    %r14, %rcx
        cmpq    %rax, %r13
        jne     .LBB6_12
        movq    %r13, %rax
        testq   %r15, %r15
        je      .LBB6_17
.LBB6_15:
        leaq    (%rdi,%rax,8), %r8
        imulq   %r14, %rax
        addq    %rbp, %rax
        xorl    %edx, %edx
        .p2align        4, 0x90
.LBB6_16:
        movq    (%rax), %rcx
        movq    %rcx, (%r8,%rdx,8)
        incq    %rdx
        addq    %r14, %rax
        cmpq    %rdx, %r15
        jne     .LBB6_16
.LBB6_17:
        testq   %r9, %r9
        jle     .LBB6_18
        cmpq    $3, %r9
        movq    40(%rsp), %r8
        ja      .LBB6_32
        xorl    %eax, %eax
        jmp     .LBB6_31
.LBB6_32:
        cmpq    $28, 184(%rsp)
        jae     .LBB6_34
        xorl    %eax, %eax
        jmp     .LBB6_40
.LBB6_34:
        cmpq    $0, 144(%rsp)
        je      .LBB6_35
        movq    128(%rsp), %rcx
        xorl    %eax, %eax
        movq    80(%rsp), %rdx
        .p2align        4, 0x90
.LBB6_37:
        vmovups (%rdi,%rax,8), %ymm0
        vmovups %ymm0, -480(%rdx,%rax,8)
        vmovups 32(%rdi,%rax,8), %ymm0
        vmovups %ymm0, -448(%rdx,%rax,8)
        vmovups 64(%rdi,%rax,8), %ymm0
        vmovups %ymm0, -416(%rdx,%rax,8)
        vmovups 96(%rdi,%rax,8), %ymm0
        vmovups %ymm0, -384(%rdx,%rax,8)
        vmovups 128(%rdi,%rax,8), %ymm0
        vmovups %ymm0, -352(%rdx,%rax,8)
        vmovups 160(%rdi,%rax,8), %ymm0
        vmovups %ymm0, -320(%rdx,%rax,8)
        vmovups 192(%rdi,%rax,8), %ymm0
        vmovups %ymm0, -288(%rdx,%rax,8)
        vmovups 224(%rdi,%rax,8), %ymm0
        vmovups %ymm0, -256(%rdx,%rax,8)
        vmovups 256(%rdi,%rax,8), %ymm0
        vmovups %ymm0, -224(%rdx,%rax,8)
        vmovups 288(%rdi,%rax,8), %ymm0
        vmovups %ymm0, -192(%rdx,%rax,8)
        vmovups 320(%rdi,%rax,8), %ymm0
        vmovups %ymm0, -160(%rdx,%rax,8)
        vmovups 352(%rdi,%rax,8), %ymm0
        vmovups %ymm0, -128(%rdx,%rax,8)
        vmovups 384(%rdi,%rax,8), %ymm0
        vmovups %ymm0, -96(%rdx,%rax,8)
        vmovups 416(%rdi,%rax,8), %ymm0
        vmovups %ymm0, -64(%rdx,%rax,8)
        vmovups 448(%rdi,%rax,8), %ymm0
        vmovups %ymm0, -32(%rdx,%rax,8)
        vmovupd 480(%rdi,%rax,8), %ymm0
        vmovupd %ymm0, (%rdx,%rax,8)
        addq    $64, %rax
        addq    $-2, %rcx
        jne     .LBB6_37
        testb   $1, 136(%rsp)
        je      .LBB6_40
.LBB6_39:
        vmovups (%rdi,%rax,8), %ymm0
        movq    104(%rsp), %rcx
        vmovups %ymm0, -224(%rcx,%rax,8)
        vmovups 32(%rdi,%rax,8), %ymm0
        vmovups %ymm0, -192(%rcx,%rax,8)
        vmovups 64(%rdi,%rax,8), %ymm0
        vmovups %ymm0, -160(%rcx,%rax,8)
        vmovups 96(%rdi,%rax,8), %ymm0
        vmovups %ymm0, -128(%rcx,%rax,8)
        vmovups 128(%rdi,%rax,8), %ymm0
        vmovups %ymm0, -96(%rcx,%rax,8)
        vmovups 160(%rdi,%rax,8), %ymm0
        vmovups %ymm0, -64(%rcx,%rax,8)
        vmovups 192(%rdi,%rax,8), %ymm0
        vmovups %ymm0, -32(%rcx,%rax,8)
        vmovupd 224(%rdi,%rax,8), %ymm0
        vmovupd %ymm0, (%rcx,%rax,8)
        addq    $32, %rax
.LBB6_40:
        cmpq    $0, 168(%rsp)
        je      .LBB6_42
        movq    72(%rsp), %rcx
        leaq    (%rcx,%rax,8), %rcx
        leaq    (%rdi,%rax,8), %rdx
        movq    152(%rsp), %r8
        movabsq $memcpy, %rax
        vzeroupper
        callq   *%rax
        movq    48(%rsp), %r11
        movq    40(%rsp), %r8
        movq    488(%rsp), %r9
.LBB6_42:
        movq    176(%rsp), %rcx
        movq    %rcx, %rax
        cmpq    %r9, %rcx
        movq    56(%rsp), %r10
        je      .LBB6_26
.LBB6_31:
        movq    72(%rsp), %rcx
        leaq    (%rcx,%rax,8), %rcx
        leaq    (%rdi,%rax,8), %rdx
        shlq    $3, %rax
        movq    88(%rsp), %r8
        subq    %rax, %r8
        movabsq $memcpy, %rax
        vzeroupper
        callq   *%rax
        movq    48(%rsp), %r11
        movq    40(%rsp), %r8
        movq    56(%rsp), %r10
        movq    488(%rsp), %r9
        jmp     .LBB6_26
.LBB6_35:
        xorl    %eax, %eax
        testb   $1, 136(%rsp)
        jne     .LBB6_39
        jmp     .LBB6_40
.LBB6_18:
        movq    40(%rsp), %r8
        jmp     .LBB6_26
.LBB6_48:
        movabsq $NRT_decref, %rsi
        movq    440(%rsp), %rcx
        vzeroupper
        callq   *%rsi
        movq    192(%rsp), %rcx
        callq   *%rsi
        movq    64(%rsp), %rcx
        callq   *%rsi
        movq    200(%rsp), %rax
        movq    $0, (%rax)
        xorl    %eax, %eax
        jmp     .LBB6_47
.LBB6_46:
        vxorps  %xmm0, %xmm0, %xmm0
        movq    120(%rsp), %rax
        vmovups %xmm0, (%rax)
        movq    $0, 16(%rax)
        movq    440(%rsp), %rcx
        movabsq $NRT_incref, %rax
        vzeroupper
        callq   *%rax
        movabsq $.const.picklebuf.2691622873664, %rax
        movq    160(%rsp), %rcx
        movq    %rax, (%rcx)
        movl    $1, %eax
.LBB6_47:
        vmovaps 256(%rsp), %xmm6
        addq    $280, %rsp
        popq    %rbx
        popq    %rbp
        popq    %rdi
        popq    %rsi
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %r15
        retq
.Lfunc_end6:
        .size   _ZN13_3cdynamic_3e40__numba_parfor_gufunc_0x272d183ed00_2487B104c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Ai0vgkhqAgA_3dE5ArrayIyLi1E1C7mutable7alignedE23Literal_5bint_5d_283_29x5ArrayIdLi2E1C7mutable7alignedE, .Lfunc_end6-_ZN13_3cdynamic_3e40__numba_parfor_gufunc_0x272d183ed00_2487B104c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Ai0vgkhqAgA_3dE5ArrayIyLi1E1C7mutable7alignedE23Literal_5bint_5d_283_29x5ArrayIdLi2E1C7mutable7alignedE
        .cfi_endproc

        .weak   NRT_incref
        .p2align        4, 0x90
        .type   NRT_incref,@function
NRT_incref:
        testq   %rcx, %rcx
        je      .LBB7_1
        lock            incq    (%rcx)
        retq
.LBB7_1:
        retq
.Lfunc_end7:

    -----   WITH VIEWS    -----

__gufunc__._ZN13_3cdynamic_3e40__numba_parfor_gufunc_0x272bef6ed00_2491B104c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Ai0vgkhqAgA_3dE5ArrayIyLi1E1C7mutable7alignedE23Literal_5bint_5d_283_29x5ArrayIdLi2E1C7mutable7alignedE:
        .cfi_startproc
        pushq   %r15
        .cfi_def_cfa_offset 16
        pushq   %r14
        .cfi_def_cfa_offset 24
        pushq   %r13
        .cfi_def_cfa_offset 32
        pushq   %r12
        .cfi_def_cfa_offset 40
        pushq   %rsi
        .cfi_def_cfa_offset 48
        pushq   %rdi
        .cfi_def_cfa_offset 56
        pushq   %rbp
        .cfi_def_cfa_offset 64
        pushq   %rbx
        .cfi_def_cfa_offset 72
        subq    $168, %rsp
        vmovaps %xmm6, 144(%rsp)
        .cfi_def_cfa_offset 240
        .cfi_offset %rbx, -72
        .cfi_offset %rbp, -64
        .cfi_offset %rdi, -56
        .cfi_offset %rsi, -48
        .cfi_offset %r12, -40
        .cfi_offset %r13, -32
        .cfi_offset %r14, -24
        .cfi_offset %r15, -16
        .cfi_offset %xmm6, -96
        movq    (%rdx), %rax
        movq    24(%rdx), %r12
        movq    (%rcx), %rdx
        movq    %rdx, 120(%rsp)
        movq    8(%rcx), %rdx
        movq    %rdx, 112(%rsp)
        movq    (%r8), %rdx
        movq    %rdx, 104(%rsp)
        movq    8(%r8), %rdx
        movq    %rdx, 96(%rsp)
        movq    16(%rcx), %rdx
        movq    %rdx, 88(%rsp)
        movq    16(%r8), %rdx
        movq    %rdx, 80(%rsp)
        movq    24(%rcx), %rcx
        movq    %rcx, 64(%rsp)
        movq    24(%r8), %rcx
        movq    %rcx, 56(%rsp)
        movl    $0, 36(%rsp)
        movq    %rax, 72(%rsp)
        testq   %rax, %rax
        jle     .LBB5_12
        cmpq    $4, %r12
        movl    $3, %eax
        cmovlq  %r12, %rax
        movq    %rax, 48(%rsp)
        movq    %r12, %rbx
        sarq    $63, %rbx
        andq    %r12, %rbx
        xorl    %eax, %eax
        vxorps  %xmm6, %xmm6, %xmm6
        jmp     .LBB5_2
        .p2align        4, 0x90
.LBB5_9:
        movq    136(%rsp), %rcx
        movabsq $NRT_decref, %rsi
        movq    %r9, %rdi
        callq   *%rsi
        movq    %rdi, %rcx
        callq   *%rsi
        movq    40(%rsp), %rax
        incq    %rax
        cmpq    72(%rsp), %rax
        je      .LBB5_12
.LBB5_2:
        movq    %rax, %rbp
        imulq   104(%rsp), %rbp
        movq    %rax, %rcx
        imulq   96(%rsp), %rcx
        movq    112(%rsp), %rdx
        movq    (%rdx,%rcx), %rcx
        movq    %rcx, 128(%rsp)
        movq    %rax, 40(%rsp)
        movq    %rax, %rcx
        imulq   80(%rsp), %rcx
        movq    88(%rsp), %rdx
        movq    (%rdx,%rcx), %rdi
        movq    120(%rsp), %rcx
        movq    (%rbp,%rcx), %r13
        movq    8(%rbp,%rcx), %r14
        subq    %r13, %r14
        incq    %r14
        movl    $24, %ecx
        movl    $32, %edx
        movabsq $NRT_MemInfo_alloc_safe_aligned, %rsi
        callq   *%rsi
        movq    %rax, 136(%rsp)
        movq    24(%rax), %r15
        movl    $24, %ecx
        movl    $32, %edx
        callq   *%rsi
        movq    %rax, %r9
        testq   %r14, %r14
        jle     .LBB5_9
        xorl    %edx, %edx
        testq   %rdi, %rdi
        setg    %r8b
        testq   %rdi, %rdi
        jle     .LBB5_9
        movq    128(%rsp), %rax
        cmpq    %rax, 48(%rsp)
        jne     .LBB5_10
        movq    40(%rsp), %rax
        imulq   56(%rsp), %rax
        addq    64(%rsp), %rax
        movq    24(%r9), %rcx
        movb    %r8b, %dl
        negq    %rdx
        leaq    (%rdi,%rdx), %rsi
        incq    %rsi
        .p2align        4, 0x90
.LBB5_6:
        movq    %r13, %rdi
        imulq   %r12, %rdi
        addq    %rbx, %rdi
        movq    %rsi, %rdx
        xorl    %ebp, %ebp
        .p2align        4, 0x90
.LBB5_7:
        vcvtsi2sd       %rbp, %xmm2, %xmm0
        vsqrtsd %xmm0, %xmm0, %xmm0
        vmovups %xmm6, (%r15)
        movq    $0, 16(%r15)
        vmovsd  %xmm0, (%rcx)
        vaddsd  (%rax,%rdi,8), %xmm0, %xmm1
        vmovsd  %xmm1, (%rax,%rdi,8)
        vaddsd  8(%r15), %xmm0, %xmm1
        vmovsd  %xmm1, 8(%rcx)
        vaddsd  8(%rax,%rdi,8), %xmm1, %xmm1
        vmovsd  %xmm1, 8(%rax,%rdi,8)
        vaddsd  16(%r15), %xmm0, %xmm0
        vmovsd  %xmm0, 16(%rcx)
        vaddsd  16(%rax,%rdi,8), %xmm0, %xmm0
        vmovsd  %xmm0, 16(%rax,%rdi,8)
        addq    %r13, %rbp
        decq    %rdx
        testq   %rdx, %rdx
        jg      .LBB5_7
        leaq    -1(%r14), %rdx
        incq    %r13
        cmpq    $1, %r14
        movq    %rdx, %r14
        jg      .LBB5_6
        jmp     .LBB5_9
.LBB5_10:
        vxorps  %xmm0, %xmm0, %xmm0
        vmovups %xmm0, (%r15)
        movq    $0, 16(%r15)
        movabsq $numba_gil_ensure, %rax
        leaq    36(%rsp), %rcx
        callq   *%rax
        movabsq $PyErr_Clear, %rax
        callq   *%rax
        movabsq $.const.pickledata.2691858029760, %rcx
        movabsq $.const.pickledata.2691858029760.sha1, %r8
        movabsq $numba_unpickle, %rax
        movl    $180, %edx
        callq   *%rax
        testq   %rax, %rax
        je      .LBB5_11
        movabsq $numba_do_raise, %rdx
        movq    %rax, %rcx
        callq   *%rdx
.LBB5_11:
        movabsq $numba_gil_release, %rax
        leaq    36(%rsp), %rcx
        callq   *%rax
.LBB5_12:
        vmovaps 144(%rsp), %xmm6
        addq    $168, %rsp
        popq    %rbx
        popq    %rbp
        popq    %rdi
        popq    %rsi
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %r15
        retq
.Lfunc_end5:

The code of the first version is huge compared to the second part. Overall, we can see that the most computational part is about the same:

    -----  WITHOUT VIEWS  ----- 

.LBB6_6:
        imulq   %r8, %r12
        vcvtsi2sd       %r12, %xmm2, %xmm0
        vsqrtsd %xmm0, %xmm0, %xmm0
        movq    120(%rsp), %rcx
        vmovups %xmm6, (%rcx)
        movq    $0, 16(%rcx)
        movq    248(%rsp), %rdx
        vmovsd  %xmm0, (%rdx)
        vaddsd  (%rbp), %xmm0, %xmm1
        vmovsd  %xmm1, (%rbp)
        vaddsd  8(%rcx), %xmm0, %xmm1
        vmovsd  %xmm1, 8(%rdx)
        vaddsd  8(%rbp), %xmm1, %xmm1
        vmovsd  %xmm1, 8(%rbp)
        vaddsd  16(%rcx), %xmm0, %xmm0
        vmovsd  %xmm0, 16(%rdx)
        movq    %rax, %r12
        vaddsd  16(%rbp), %xmm0, %xmm0
        vmovsd  %xmm0, 16(%rbp)
        cmpb    $0, 39(%rsp)
        jne     .LBB6_7

    -----   WITH VIEWS    -----

.LBB5_7:
        vcvtsi2sd       %rbp, %xmm2, %xmm0
        vsqrtsd %xmm0, %xmm0, %xmm0
        vmovups %xmm6, (%r15)
        movq    $0, 16(%r15)
        vmovsd  %xmm0, (%rcx)
        vaddsd  (%rax,%rdi,8), %xmm0, %xmm1
        vmovsd  %xmm1, (%rax,%rdi,8)
        vaddsd  8(%r15), %xmm0, %xmm1
        vmovsd  %xmm1, 8(%rcx)
        vaddsd  8(%rax,%rdi,8), %xmm1, %xmm1
        vmovsd  %xmm1, 8(%rax,%rdi,8)
        vaddsd  16(%r15), %xmm0, %xmm0
        vmovsd  %xmm0, 16(%rcx)
        vaddsd  16(%rax,%rdi,8), %xmm0, %xmm0
        vmovsd  %xmm0, 16(%rax,%rdi,8)
        addq    %r13, %rbp
        decq    %rdx
        testq   %rdx, %rdx
        jg      .LBB5_7

While the code of the first version is a bit less efficient than the second one, the difference is certainly far from being sufficient to explain the huge gap in the timings (~65 ms VS <0.6ms).

We can also see that the function calls in the assembly code are different between the two versions:

    -----  WITHOUT VIEWS  ----- 

memcpy
NRT_Allocate
NRT_Free
NRT_decref
NRT_incref
NRT_MemInfo_alloc_safe_aligned

    -----   WITH VIEWS    -----

numba_do_raise
numba_gil_ensure
numba_gil_release
numba_unpickle
PyErr_Clear
NRT_decref
NRT_MemInfo_alloc_safe_aligned

The NRT_Allocate, NRT_Free, NRT_decref, NRT_incref function calls indicate that the compiled code create a new Python object in the middle of the hot loop which is very inefficient. Meanwhile, the second version does not perform any NRT_incref and I suspect NRT_decref is never actually called (or maybe just once). The second code performs no Numpy array allocations. It looks like the calls to PyErr_Clear, numba_do_raise and numba_unpickle are made to manage exception that can possibly be raised (but not in the first version surprizingly so it is likely related to the use of views). Finally, the call to memcpy in the first version shows that the newly created array is certainly copied to the x. The allocation and the copy makes the first version very inefficient.

I am pretty surprized that Numba does not generate allocations for zeros(3). This is grea, but you should really avoid creating arrays in hot loops like this since there is no garantee Numba will always optimize such a call. In fact, it often does not.

You can use a basic loop to copy all items of a slice so to avoid any allocation. This is often faster if the size of the slice is known at compile time. Slice copies could be faster since the copiler might better vectorize the code, but in practice, such loops are relatively well auto-vectorized.

One can note that there is the vsqrtsd instruction in the code of both versions so the lambda is actually inlined.

When you move the lambda away of the function and put its content in another jitted function, LLVM may not inline the function. You can request Numba to inline the function manually before the generation of the intermediate representation (IR code) so that LLVM should generate a similar code. This can be done using the inline="always" flag. This tends to increase the compilation time though (since the code is nearly copy-pasted in the caller function). Inlining is critical for applying many further optimizations (constant propagation, SIMD vectorization, etc.) which can result in a huge performance boost.