Search code examples
pythonperformancenumba

Why do small changes have dramatic effects on the runtime of my numba parallel function?


I'm trying to understand why my parallelized numba function is acting the way it does. In particular, why it is so sensitive to how arrays are being used.

I have the following function:

@njit(parallel=True)
def f(n):
    g = lambda i,j: zeros(3) + sqrt(i*j)
    x = zeros((n,3))
    for i in prange(n):
        for j in range(n):
            tmp      = g(i,j)
            x[i] += tmp
    return x

Trust that n is large enough for parallel computing to be useful. For some reason this actually runs faster with fewer cores. Now when I make a small change (x[i] -> x[i, :]).

@njit(parallel=True)
def f(n):
    g = lambda i,j: zeros(3) + sqrt(i*j)
    x = zeros((n,3))
    for i in prange(n):
        for j in range(n):
            tmp      = g(i,j)
            x[i, :] += tmp
    return x

The performance is significantly better, and it scales properly with the number of cores (ie. more cores is faster). Why does slicing make the performance better? To go even further, another change that makes a big difference is turning the lambda function into and external njit function.

@njit
def g(i,j):
    x = zeros(3) + sqrt(i*j)
    return x

@njit(parallel=True)
def f(n):
    x = zeros((n,3))
    for i in prange(n):
        for j in range(n):
            tmp      = g(i,j)
            x[i, :] += tmp
    return x

This again ruins the performance and scaling, reverting back to runtimes equal to or slower than the first case. Why does this external function ruin the performance? The performance can be recovered with two options shown below.

@njit
def g(i,j):
    x = sqrt(i*j)
    return x

@njit(parallel=True)
def f(n):
    x = zeros((n,3))
    for i in prange(n):
        for j in range(n):
            tmp      = zeros(3) + g(i,j)
            x[i, :] += tmp
    return x
@njit(parallel=True)
def f(n):
    def g(i,j):
        x = zeros(3) + sqrt(i*j)
        return x
    x = zeros((n,3))
    for i in prange(n):
        for j in range(n):
            tmp      = g(i,j)
            x[i, :] += tmp
    return x

Why is the parallel=True numba decorated function so sensitive to how arrays are being used? I know arrays are not trivially parallelizable, but the exact reason each of these changes dramatically effects performance isn't obvious to me.


Solution

  • TL;DR: allocations and inlining are certainly the source of the performance gap between the different version.

    Operating on Numpy array is generally a bit more expensive than view in Numba. In this case, the problem appear to be that Numba perform an allocation when using x[i] while it does not with x[i,:]. The thing is allocations are expensive, especially in parallel codes since allocators tends not to scale (due to internal locks or atomic variables serializing the execution). I am not sure this is a missed optimization since x[i] and x[i,:] might have a slightly different behaviour.

    In addition, Numba uses a JIT compiler (LLVM-Lite) which perform aggressive optimizations. LLVM is able to track allocations so to remove them in simple cases (like a function doing an allocation and freeing data just after in the same scope without side effects). The thing is Numba allocations are calling an external function that the compiler cannot optimize as it does not know the content at compile time (due to the way the Numba runtime interface currently works) and the function could theoretically have side effects.

    To show what is happening, we need to delve into the assembly code. Overall, Numba generates a function for f calling a xxx_numba_parfor_gufunc_xxx function in N threads. This last function executes the content of the parallel loop. The caller function is the same for both implementation. The main computing function is different for the two version. Here is the assembly code on my machine:

        -----  WITHOUT VIEWS  ----- 
    
    _ZN13_3cdynamic_3e40__numba_parfor_gufunc_0x272d183ed00_2487B104c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Ai0vgkhqAgA_3dE5ArrayIyLi1E1C7mutable7alignedE23Literal_5bint_5d_283_29x5ArrayIdLi2E1C7mutable7alignedE:
            .cfi_startproc
            pushq   %r15
            .cfi_def_cfa_offset 16
            pushq   %r14
            .cfi_def_cfa_offset 24
            pushq   %r13
            .cfi_def_cfa_offset 32
            pushq   %r12
            .cfi_def_cfa_offset 40
            pushq   %rsi
            .cfi_def_cfa_offset 48
            pushq   %rdi
            .cfi_def_cfa_offset 56
            pushq   %rbp
            .cfi_def_cfa_offset 64
            pushq   %rbx
            .cfi_def_cfa_offset 72
            subq    $280, %rsp
            vmovaps %xmm6, 256(%rsp)
            .cfi_def_cfa_offset 352
            .cfi_offset %rbx, -72
            .cfi_offset %rbp, -64
            .cfi_offset %rdi, -56
            .cfi_offset %rsi, -48
            .cfi_offset %r12, -40
            .cfi_offset %r13, -32
            .cfi_offset %r14, -24
            .cfi_offset %r15, -16
            .cfi_offset %xmm6, -96
            movq    %rdx, 160(%rsp)
            movq    %rcx, 200(%rsp)
            movq    504(%rsp), %r14
            movq    488(%rsp), %r15
            leaq    -1(%r15), %rax
            imulq   %r14, %rax
            xorl    %ebp, %ebp
            testq   %rax, %rax
            movq    %rax, %rdx
            cmovnsq %rbp, %rdx
            cmpq    $1, %r15
            cmovbq  %rbp, %rdx
            movq    %rdx, 240(%rsp)
            movq    %rax, %rdx
            sarq    $63, %rdx
            andnq   %rax, %rdx, %rax
            addq    464(%rsp), %rax
            movq    %r15, %rbx
            subq    $1, %rbx
            movq    440(%rsp), %rcx
            movq    400(%rsp), %rsi
            movabsq $NRT_incref, %rdx
            cmovbq  %rbp, %rax
            movq    %rax, 232(%rsp)
            callq   *%rdx
            movq    (%rsi), %rbp
            movq    8(%rsi), %rdi
            subq    %rbp, %rdi
            incq    %rdi
            movabsq $NRT_MemInfo_alloc_safe_aligned, %rsi
            movl    $24, %ecx
            movl    $32, %edx
            callq   *%rsi
            movq    %rax, 192(%rsp)
            movq    24(%rax), %rax
            movq    %rax, 120(%rsp)
            movl    $24, %ecx
            movl    $32, %edx
            callq   *%rsi
            movq    %rax, 64(%rsp)
            testq   %rdi, %rdi
            jle     .LBB6_48
            movq    %rdi, %r11
            movq    %rbp, %r8
            movq    %rbx, %r10
            movq    %r15, %r9
            movq    432(%rsp), %rdx
            movq    472(%rsp), %rdi
            movq    %r15, %rax
            imulq   464(%rsp), %rax
            movq    %rax, 208(%rsp)
            xorl    %eax, %eax
            testq   %rdx, %rdx
            setg    %al
            movq    %rdx, %rcx
            sarq    $63, %rcx
            andnq   %rdx, %rcx, %rcx
            subq    %rax, %rcx
            movq    %rcx, 224(%rsp)
            leaq    -4(%r15), %rax
            movq    %rax, 184(%rsp)
            shrq    $2, %rax
            incq    %rax
            andl    $7, %r15d
            movq    %r9, %r13
            andq    $-8, %r13
            movq    %r9, %rcx
            andq    $-4, %rcx
            movq    %rcx, 176(%rsp)
            movl    %eax, %ecx
            andl    $7, %ecx
            movq    %rbp, %rdx
            imulq   %r9, %rdx
            movq    %rcx, 168(%rsp)
            shlq    $5, %rcx
            movq    %rcx, 152(%rsp)
            andq    $-8, %rax
            addq    $-8, %rax
            movq    %rax, 144(%rsp)
            movq    %rax, %rcx
            shrq    $3, %rcx
            incq    %rcx
            movq    %rcx, %rax
            movq    %rcx, 136(%rsp)
            andq    $-2, %rcx
            movq    %rcx, 128(%rsp)
            vxorps  %xmm6, %xmm6, %xmm6
            movq    64(%rsp), %rax
            movq    24(%rax), %rax
            movq    %rax, 248(%rsp)
            leaq    56(%rdi,%rdx,8), %rsi
            leaq    224(%rdi,%rdx,8), %rcx
            leaq    (,%r9,8), %rax
            movq    %rax, 88(%rsp)
            leaq    (%rdi,%rdx,8), %rax
            addq    $480, %rax
            movq    %rax, 80(%rsp)
            xorl    %eax, %eax
            movq    %rax, 96(%rsp)
            movq    %rdx, 216(%rsp)
            movq    %rdx, 112(%rsp)
            movq    %rbx, 56(%rsp)
            jmp     .LBB6_3
            .p2align        4, 0x90
    .LBB6_2:
            leaq    -1(%r11), %rax
            incq    %r8
            addq    %r9, 112(%rsp)
            movq    104(%rsp), %rcx
            leaq    (%rcx,%r9,8), %rcx
            incq    96(%rsp)
            movq    88(%rsp), %rdx
            addq    %rdx, %rsi
            addq    %rdx, 80(%rsp)
            cmpq    $2, %r11
            movq    %rax, %r11
            jl      .LBB6_48
    .LBB6_3:
            movq    %rcx, 104(%rsp)
            movq    %r8, %rax
            imulq   %r9, %rax
            movq    472(%rsp), %rdi
            leaq    (%rdi,%rax,8), %rbp
            movq    240(%rsp), %rax
            addq    %rbp, %rax
            movq    232(%rsp), %rcx
            addq    %rbp, %rcx
            movq    %r8, %rdx
            imulq   496(%rsp), %rdx
            movq    464(%rsp), %rbx
            addq    %rdx, %rbx
            testq   %r9, %r9
            cmoveq  %r9, %rdx
            cmoveq  %r9, %rbx
            addq    %rdi, %rdx
            addq    %rdi, %rbx
            cmpq    %rbx, %rax
            setb    39(%rsp)
            cmpq    %rcx, %rdx
            setb    %al
            cmpq    $0, 432(%rsp)
            jle     .LBB6_2
            cmpq    424(%rsp), %r9
            jne     .LBB6_46
            movq    96(%rsp), %rcx
            imulq   %r9, %rcx
            addq    216(%rsp), %rcx
            andb    %al, 39(%rsp)
            movq    472(%rsp), %rax
            leaq    (%rax,%rcx,8), %rax
            movq    %rax, 72(%rsp)
            movl    $1, %eax
            movq    224(%rsp), %rbx
            xorl    %r12d, %r12d
            .p2align        4, 0x90
    .LBB6_6:
            imulq   %r8, %r12
            vcvtsi2sd       %r12, %xmm2, %xmm0
            vsqrtsd %xmm0, %xmm0, %xmm0
            movq    120(%rsp), %rcx
            vmovups %xmm6, (%rcx)
            movq    $0, 16(%rcx)
            movq    248(%rsp), %rdx
            vmovsd  %xmm0, (%rdx)
            vaddsd  (%rbp), %xmm0, %xmm1
            vmovsd  %xmm1, (%rbp)
            vaddsd  8(%rcx), %xmm0, %xmm1
            vmovsd  %xmm1, 8(%rdx)
            vaddsd  8(%rbp), %xmm1, %xmm1
            vmovsd  %xmm1, 8(%rbp)
            vaddsd  16(%rcx), %xmm0, %xmm0
            vmovsd  %xmm0, 16(%rdx)
            movq    %rax, %r12
            vaddsd  16(%rbp), %xmm0, %xmm0
            vmovsd  %xmm0, 16(%rbp)
            cmpb    $0, 39(%rsp)
            jne     .LBB6_7
            testq   %r9, %r9
            jle     .LBB6_28
            cmpq    $7, %r10
            jae     .LBB6_19
            xorl    %eax, %eax
            movq    %rbp, %rdi
            testq   %r15, %r15
            jne     .LBB6_23
            jmp     .LBB6_26
            .p2align        4, 0x90
    .LBB6_19:
            movq    %rbp, %rcx
            xorl    %eax, %eax
            .p2align        4, 0x90
    .LBB6_20:
            movq    (%rcx), %rdx
            movq    %rdx, -56(%rsi,%rax,8)
            leaq    (%r14,%rcx), %rdx
            movq    (%r14,%rcx), %rcx
            movq    %rcx, -48(%rsi,%rax,8)
            leaq    (%r14,%rdx), %rcx
            movq    (%r14,%rdx), %rdx
            movq    %rdx, -40(%rsi,%rax,8)
            leaq    (%r14,%rcx), %rdx
            movq    (%r14,%rcx), %rcx
            movq    %rcx, -32(%rsi,%rax,8)
            leaq    (%r14,%rdx), %rcx
            movq    (%r14,%rdx), %rdx
            movq    %rdx, -24(%rsi,%rax,8)
            leaq    (%r14,%rcx), %rdx
            movq    (%r14,%rcx), %rcx
            movq    %rcx, -16(%rsi,%rax,8)
            leaq    (%r14,%rdx), %rdi
            movq    (%r14,%rdx), %rcx
            movq    %rcx, -8(%rsi,%rax,8)
            leaq    (%r14,%rdi), %rcx
            movq    (%r14,%rdi), %rdx
            movq    %rdx, (%rsi,%rax,8)
            addq    $8, %rax
            addq    %r14, %rcx
            cmpq    %rax, %r13
            jne     .LBB6_20
            movq    %r13, %rax
            movq    %rbp, %rdi
            testq   %r15, %r15
            je      .LBB6_26
    .LBB6_23:
            movq    112(%rsp), %rcx
            addq    %rax, %rcx
            imulq   %r14, %rax
            addq    %rbp, %rax
            movq    472(%rsp), %rdx
            leaq    (%rdx,%rcx,8), %rcx
            xorl    %edx, %edx
            .p2align        4, 0x90
    .LBB6_24:
            movq    (%rax), %rdi
            movq    %rdi, (%rcx,%rdx,8)
            incq    %rdx
            addq    %r14, %rax
            cmpq    %rdx, %r15
            jne     .LBB6_24
            movq    %rbp, %rdi
    .LBB6_26:
            cmpb    $0, 39(%rsp)
            jne     .LBB6_27
    .LBB6_28:
            xorl    %eax, %eax
            testq   %rbx, %rbx
            setg    %al
            movq    %rbx, %rcx
            subq    %rax, %rcx
            addq    %r12, %rax
            testq   %rbx, %rbx
            movq    %rcx, %rbx
            jg      .LBB6_6
            jmp     .LBB6_2
    .LBB6_7:
            movq    %r11, 48(%rsp)
            movq    %r8, 40(%rsp)
            movq    208(%rsp), %rcx
            movabsq $NRT_Allocate, %rax
            vzeroupper
            callq   *%rax
            movq    488(%rsp), %r9
            movq    %rax, %rdi
            testq   %r9, %r9
            jle     .LBB6_8
            movq    56(%rsp), %r10
            cmpq    $7, %r10
            movq    48(%rsp), %r11
            jae     .LBB6_11
            xorl    %eax, %eax
            testq   %r15, %r15
            jne     .LBB6_15
            jmp     .LBB6_17
    .LBB6_8:
            movq    40(%rsp), %r8
            movq    48(%rsp), %r11
    .LBB6_27:
            movq    %r8, 40(%rsp)
            movq    %rdi, %rcx
            movq    %r11, %rdi
            movabsq $NRT_Free, %rax
            vzeroupper
            callq   *%rax
            movq    %rdi, %r11
            movq    40(%rsp), %r8
            movq    56(%rsp), %r10
            movq    488(%rsp), %r9
            jmp     .LBB6_28
    .LBB6_11:
            movq    %rbp, %rcx
            xorl    %eax, %eax
            .p2align        4, 0x90
    .LBB6_12:
            movq    (%rcx), %rdx
            movq    %rdx, (%rdi,%rax,8)
            leaq    (%r14,%rcx), %rdx
            movq    (%r14,%rcx), %rcx
            movq    %rcx, 8(%rdi,%rax,8)
            leaq    (%r14,%rdx), %rcx
            movq    (%r14,%rdx), %rdx
            movq    %rdx, 16(%rdi,%rax,8)
            leaq    (%r14,%rcx), %rdx
            movq    (%r14,%rcx), %rcx
            movq    %rcx, 24(%rdi,%rax,8)
            leaq    (%r14,%rdx), %rcx
            movq    (%r14,%rdx), %rdx
            movq    %rdx, 32(%rdi,%rax,8)
            leaq    (%r14,%rcx), %rdx
            movq    (%r14,%rcx), %rcx
            movq    %rcx, 40(%rdi,%rax,8)
            leaq    (%r14,%rdx), %r8
            movq    (%r14,%rdx), %rcx
            movq    %rcx, 48(%rdi,%rax,8)
            leaq    (%r14,%r8), %rcx
            movq    (%r14,%r8), %rdx
            movq    %rdx, 56(%rdi,%rax,8)
            addq    $8, %rax
            addq    %r14, %rcx
            cmpq    %rax, %r13
            jne     .LBB6_12
            movq    %r13, %rax
            testq   %r15, %r15
            je      .LBB6_17
    .LBB6_15:
            leaq    (%rdi,%rax,8), %r8
            imulq   %r14, %rax
            addq    %rbp, %rax
            xorl    %edx, %edx
            .p2align        4, 0x90
    .LBB6_16:
            movq    (%rax), %rcx
            movq    %rcx, (%r8,%rdx,8)
            incq    %rdx
            addq    %r14, %rax
            cmpq    %rdx, %r15
            jne     .LBB6_16
    .LBB6_17:
            testq   %r9, %r9
            jle     .LBB6_18
            cmpq    $3, %r9
            movq    40(%rsp), %r8
            ja      .LBB6_32
            xorl    %eax, %eax
            jmp     .LBB6_31
    .LBB6_32:
            cmpq    $28, 184(%rsp)
            jae     .LBB6_34
            xorl    %eax, %eax
            jmp     .LBB6_40
    .LBB6_34:
            cmpq    $0, 144(%rsp)
            je      .LBB6_35
            movq    128(%rsp), %rcx
            xorl    %eax, %eax
            movq    80(%rsp), %rdx
            .p2align        4, 0x90
    .LBB6_37:
            vmovups (%rdi,%rax,8), %ymm0
            vmovups %ymm0, -480(%rdx,%rax,8)
            vmovups 32(%rdi,%rax,8), %ymm0
            vmovups %ymm0, -448(%rdx,%rax,8)
            vmovups 64(%rdi,%rax,8), %ymm0
            vmovups %ymm0, -416(%rdx,%rax,8)
            vmovups 96(%rdi,%rax,8), %ymm0
            vmovups %ymm0, -384(%rdx,%rax,8)
            vmovups 128(%rdi,%rax,8), %ymm0
            vmovups %ymm0, -352(%rdx,%rax,8)
            vmovups 160(%rdi,%rax,8), %ymm0
            vmovups %ymm0, -320(%rdx,%rax,8)
            vmovups 192(%rdi,%rax,8), %ymm0
            vmovups %ymm0, -288(%rdx,%rax,8)
            vmovups 224(%rdi,%rax,8), %ymm0
            vmovups %ymm0, -256(%rdx,%rax,8)
            vmovups 256(%rdi,%rax,8), %ymm0
            vmovups %ymm0, -224(%rdx,%rax,8)
            vmovups 288(%rdi,%rax,8), %ymm0
            vmovups %ymm0, -192(%rdx,%rax,8)
            vmovups 320(%rdi,%rax,8), %ymm0
            vmovups %ymm0, -160(%rdx,%rax,8)
            vmovups 352(%rdi,%rax,8), %ymm0
            vmovups %ymm0, -128(%rdx,%rax,8)
            vmovups 384(%rdi,%rax,8), %ymm0
            vmovups %ymm0, -96(%rdx,%rax,8)
            vmovups 416(%rdi,%rax,8), %ymm0
            vmovups %ymm0, -64(%rdx,%rax,8)
            vmovups 448(%rdi,%rax,8), %ymm0
            vmovups %ymm0, -32(%rdx,%rax,8)
            vmovupd 480(%rdi,%rax,8), %ymm0
            vmovupd %ymm0, (%rdx,%rax,8)
            addq    $64, %rax
            addq    $-2, %rcx
            jne     .LBB6_37
            testb   $1, 136(%rsp)
            je      .LBB6_40
    .LBB6_39:
            vmovups (%rdi,%rax,8), %ymm0
            movq    104(%rsp), %rcx
            vmovups %ymm0, -224(%rcx,%rax,8)
            vmovups 32(%rdi,%rax,8), %ymm0
            vmovups %ymm0, -192(%rcx,%rax,8)
            vmovups 64(%rdi,%rax,8), %ymm0
            vmovups %ymm0, -160(%rcx,%rax,8)
            vmovups 96(%rdi,%rax,8), %ymm0
            vmovups %ymm0, -128(%rcx,%rax,8)
            vmovups 128(%rdi,%rax,8), %ymm0
            vmovups %ymm0, -96(%rcx,%rax,8)
            vmovups 160(%rdi,%rax,8), %ymm0
            vmovups %ymm0, -64(%rcx,%rax,8)
            vmovups 192(%rdi,%rax,8), %ymm0
            vmovups %ymm0, -32(%rcx,%rax,8)
            vmovupd 224(%rdi,%rax,8), %ymm0
            vmovupd %ymm0, (%rcx,%rax,8)
            addq    $32, %rax
    .LBB6_40:
            cmpq    $0, 168(%rsp)
            je      .LBB6_42
            movq    72(%rsp), %rcx
            leaq    (%rcx,%rax,8), %rcx
            leaq    (%rdi,%rax,8), %rdx
            movq    152(%rsp), %r8
            movabsq $memcpy, %rax
            vzeroupper
            callq   *%rax
            movq    48(%rsp), %r11
            movq    40(%rsp), %r8
            movq    488(%rsp), %r9
    .LBB6_42:
            movq    176(%rsp), %rcx
            movq    %rcx, %rax
            cmpq    %r9, %rcx
            movq    56(%rsp), %r10
            je      .LBB6_26
    .LBB6_31:
            movq    72(%rsp), %rcx
            leaq    (%rcx,%rax,8), %rcx
            leaq    (%rdi,%rax,8), %rdx
            shlq    $3, %rax
            movq    88(%rsp), %r8
            subq    %rax, %r8
            movabsq $memcpy, %rax
            vzeroupper
            callq   *%rax
            movq    48(%rsp), %r11
            movq    40(%rsp), %r8
            movq    56(%rsp), %r10
            movq    488(%rsp), %r9
            jmp     .LBB6_26
    .LBB6_35:
            xorl    %eax, %eax
            testb   $1, 136(%rsp)
            jne     .LBB6_39
            jmp     .LBB6_40
    .LBB6_18:
            movq    40(%rsp), %r8
            jmp     .LBB6_26
    .LBB6_48:
            movabsq $NRT_decref, %rsi
            movq    440(%rsp), %rcx
            vzeroupper
            callq   *%rsi
            movq    192(%rsp), %rcx
            callq   *%rsi
            movq    64(%rsp), %rcx
            callq   *%rsi
            movq    200(%rsp), %rax
            movq    $0, (%rax)
            xorl    %eax, %eax
            jmp     .LBB6_47
    .LBB6_46:
            vxorps  %xmm0, %xmm0, %xmm0
            movq    120(%rsp), %rax
            vmovups %xmm0, (%rax)
            movq    $0, 16(%rax)
            movq    440(%rsp), %rcx
            movabsq $NRT_incref, %rax
            vzeroupper
            callq   *%rax
            movabsq $.const.picklebuf.2691622873664, %rax
            movq    160(%rsp), %rcx
            movq    %rax, (%rcx)
            movl    $1, %eax
    .LBB6_47:
            vmovaps 256(%rsp), %xmm6
            addq    $280, %rsp
            popq    %rbx
            popq    %rbp
            popq    %rdi
            popq    %rsi
            popq    %r12
            popq    %r13
            popq    %r14
            popq    %r15
            retq
    .Lfunc_end6:
            .size   _ZN13_3cdynamic_3e40__numba_parfor_gufunc_0x272d183ed00_2487B104c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Ai0vgkhqAgA_3dE5ArrayIyLi1E1C7mutable7alignedE23Literal_5bint_5d_283_29x5ArrayIdLi2E1C7mutable7alignedE, .Lfunc_end6-_ZN13_3cdynamic_3e40__numba_parfor_gufunc_0x272d183ed00_2487B104c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Ai0vgkhqAgA_3dE5ArrayIyLi1E1C7mutable7alignedE23Literal_5bint_5d_283_29x5ArrayIdLi2E1C7mutable7alignedE
            .cfi_endproc
    
            .weak   NRT_incref
            .p2align        4, 0x90
            .type   NRT_incref,@function
    NRT_incref:
            testq   %rcx, %rcx
            je      .LBB7_1
            lock            incq    (%rcx)
            retq
    .LBB7_1:
            retq
    .Lfunc_end7:
    
        -----   WITH VIEWS    -----
    
    __gufunc__._ZN13_3cdynamic_3e40__numba_parfor_gufunc_0x272bef6ed00_2491B104c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Ai0vgkhqAgA_3dE5ArrayIyLi1E1C7mutable7alignedE23Literal_5bint_5d_283_29x5ArrayIdLi2E1C7mutable7alignedE:
            .cfi_startproc
            pushq   %r15
            .cfi_def_cfa_offset 16
            pushq   %r14
            .cfi_def_cfa_offset 24
            pushq   %r13
            .cfi_def_cfa_offset 32
            pushq   %r12
            .cfi_def_cfa_offset 40
            pushq   %rsi
            .cfi_def_cfa_offset 48
            pushq   %rdi
            .cfi_def_cfa_offset 56
            pushq   %rbp
            .cfi_def_cfa_offset 64
            pushq   %rbx
            .cfi_def_cfa_offset 72
            subq    $168, %rsp
            vmovaps %xmm6, 144(%rsp)
            .cfi_def_cfa_offset 240
            .cfi_offset %rbx, -72
            .cfi_offset %rbp, -64
            .cfi_offset %rdi, -56
            .cfi_offset %rsi, -48
            .cfi_offset %r12, -40
            .cfi_offset %r13, -32
            .cfi_offset %r14, -24
            .cfi_offset %r15, -16
            .cfi_offset %xmm6, -96
            movq    (%rdx), %rax
            movq    24(%rdx), %r12
            movq    (%rcx), %rdx
            movq    %rdx, 120(%rsp)
            movq    8(%rcx), %rdx
            movq    %rdx, 112(%rsp)
            movq    (%r8), %rdx
            movq    %rdx, 104(%rsp)
            movq    8(%r8), %rdx
            movq    %rdx, 96(%rsp)
            movq    16(%rcx), %rdx
            movq    %rdx, 88(%rsp)
            movq    16(%r8), %rdx
            movq    %rdx, 80(%rsp)
            movq    24(%rcx), %rcx
            movq    %rcx, 64(%rsp)
            movq    24(%r8), %rcx
            movq    %rcx, 56(%rsp)
            movl    $0, 36(%rsp)
            movq    %rax, 72(%rsp)
            testq   %rax, %rax
            jle     .LBB5_12
            cmpq    $4, %r12
            movl    $3, %eax
            cmovlq  %r12, %rax
            movq    %rax, 48(%rsp)
            movq    %r12, %rbx
            sarq    $63, %rbx
            andq    %r12, %rbx
            xorl    %eax, %eax
            vxorps  %xmm6, %xmm6, %xmm6
            jmp     .LBB5_2
            .p2align        4, 0x90
    .LBB5_9:
            movq    136(%rsp), %rcx
            movabsq $NRT_decref, %rsi
            movq    %r9, %rdi
            callq   *%rsi
            movq    %rdi, %rcx
            callq   *%rsi
            movq    40(%rsp), %rax
            incq    %rax
            cmpq    72(%rsp), %rax
            je      .LBB5_12
    .LBB5_2:
            movq    %rax, %rbp
            imulq   104(%rsp), %rbp
            movq    %rax, %rcx
            imulq   96(%rsp), %rcx
            movq    112(%rsp), %rdx
            movq    (%rdx,%rcx), %rcx
            movq    %rcx, 128(%rsp)
            movq    %rax, 40(%rsp)
            movq    %rax, %rcx
            imulq   80(%rsp), %rcx
            movq    88(%rsp), %rdx
            movq    (%rdx,%rcx), %rdi
            movq    120(%rsp), %rcx
            movq    (%rbp,%rcx), %r13
            movq    8(%rbp,%rcx), %r14
            subq    %r13, %r14
            incq    %r14
            movl    $24, %ecx
            movl    $32, %edx
            movabsq $NRT_MemInfo_alloc_safe_aligned, %rsi
            callq   *%rsi
            movq    %rax, 136(%rsp)
            movq    24(%rax), %r15
            movl    $24, %ecx
            movl    $32, %edx
            callq   *%rsi
            movq    %rax, %r9
            testq   %r14, %r14
            jle     .LBB5_9
            xorl    %edx, %edx
            testq   %rdi, %rdi
            setg    %r8b
            testq   %rdi, %rdi
            jle     .LBB5_9
            movq    128(%rsp), %rax
            cmpq    %rax, 48(%rsp)
            jne     .LBB5_10
            movq    40(%rsp), %rax
            imulq   56(%rsp), %rax
            addq    64(%rsp), %rax
            movq    24(%r9), %rcx
            movb    %r8b, %dl
            negq    %rdx
            leaq    (%rdi,%rdx), %rsi
            incq    %rsi
            .p2align        4, 0x90
    .LBB5_6:
            movq    %r13, %rdi
            imulq   %r12, %rdi
            addq    %rbx, %rdi
            movq    %rsi, %rdx
            xorl    %ebp, %ebp
            .p2align        4, 0x90
    .LBB5_7:
            vcvtsi2sd       %rbp, %xmm2, %xmm0
            vsqrtsd %xmm0, %xmm0, %xmm0
            vmovups %xmm6, (%r15)
            movq    $0, 16(%r15)
            vmovsd  %xmm0, (%rcx)
            vaddsd  (%rax,%rdi,8), %xmm0, %xmm1
            vmovsd  %xmm1, (%rax,%rdi,8)
            vaddsd  8(%r15), %xmm0, %xmm1
            vmovsd  %xmm1, 8(%rcx)
            vaddsd  8(%rax,%rdi,8), %xmm1, %xmm1
            vmovsd  %xmm1, 8(%rax,%rdi,8)
            vaddsd  16(%r15), %xmm0, %xmm0
            vmovsd  %xmm0, 16(%rcx)
            vaddsd  16(%rax,%rdi,8), %xmm0, %xmm0
            vmovsd  %xmm0, 16(%rax,%rdi,8)
            addq    %r13, %rbp
            decq    %rdx
            testq   %rdx, %rdx
            jg      .LBB5_7
            leaq    -1(%r14), %rdx
            incq    %r13
            cmpq    $1, %r14
            movq    %rdx, %r14
            jg      .LBB5_6
            jmp     .LBB5_9
    .LBB5_10:
            vxorps  %xmm0, %xmm0, %xmm0
            vmovups %xmm0, (%r15)
            movq    $0, 16(%r15)
            movabsq $numba_gil_ensure, %rax
            leaq    36(%rsp), %rcx
            callq   *%rax
            movabsq $PyErr_Clear, %rax
            callq   *%rax
            movabsq $.const.pickledata.2691858029760, %rcx
            movabsq $.const.pickledata.2691858029760.sha1, %r8
            movabsq $numba_unpickle, %rax
            movl    $180, %edx
            callq   *%rax
            testq   %rax, %rax
            je      .LBB5_11
            movabsq $numba_do_raise, %rdx
            movq    %rax, %rcx
            callq   *%rdx
    .LBB5_11:
            movabsq $numba_gil_release, %rax
            leaq    36(%rsp), %rcx
            callq   *%rax
    .LBB5_12:
            vmovaps 144(%rsp), %xmm6
            addq    $168, %rsp
            popq    %rbx
            popq    %rbp
            popq    %rdi
            popq    %rsi
            popq    %r12
            popq    %r13
            popq    %r14
            popq    %r15
            retq
    .Lfunc_end5:
    

    The code of the first version is huge compared to the second part. Overall, we can see that the most computational part is about the same:

        -----  WITHOUT VIEWS  ----- 
    
    .LBB6_6:
            imulq   %r8, %r12
            vcvtsi2sd       %r12, %xmm2, %xmm0
            vsqrtsd %xmm0, %xmm0, %xmm0
            movq    120(%rsp), %rcx
            vmovups %xmm6, (%rcx)
            movq    $0, 16(%rcx)
            movq    248(%rsp), %rdx
            vmovsd  %xmm0, (%rdx)
            vaddsd  (%rbp), %xmm0, %xmm1
            vmovsd  %xmm1, (%rbp)
            vaddsd  8(%rcx), %xmm0, %xmm1
            vmovsd  %xmm1, 8(%rdx)
            vaddsd  8(%rbp), %xmm1, %xmm1
            vmovsd  %xmm1, 8(%rbp)
            vaddsd  16(%rcx), %xmm0, %xmm0
            vmovsd  %xmm0, 16(%rdx)
            movq    %rax, %r12
            vaddsd  16(%rbp), %xmm0, %xmm0
            vmovsd  %xmm0, 16(%rbp)
            cmpb    $0, 39(%rsp)
            jne     .LBB6_7
    
        -----   WITH VIEWS    -----
    
    .LBB5_7:
            vcvtsi2sd       %rbp, %xmm2, %xmm0
            vsqrtsd %xmm0, %xmm0, %xmm0
            vmovups %xmm6, (%r15)
            movq    $0, 16(%r15)
            vmovsd  %xmm0, (%rcx)
            vaddsd  (%rax,%rdi,8), %xmm0, %xmm1
            vmovsd  %xmm1, (%rax,%rdi,8)
            vaddsd  8(%r15), %xmm0, %xmm1
            vmovsd  %xmm1, 8(%rcx)
            vaddsd  8(%rax,%rdi,8), %xmm1, %xmm1
            vmovsd  %xmm1, 8(%rax,%rdi,8)
            vaddsd  16(%r15), %xmm0, %xmm0
            vmovsd  %xmm0, 16(%rcx)
            vaddsd  16(%rax,%rdi,8), %xmm0, %xmm0
            vmovsd  %xmm0, 16(%rax,%rdi,8)
            addq    %r13, %rbp
            decq    %rdx
            testq   %rdx, %rdx
            jg      .LBB5_7
    

    While the code of the first version is a bit less efficient than the second one, the difference is certainly far from being sufficient to explain the huge gap in the timings (~65 ms VS <0.6ms).

    We can also see that the function calls in the assembly code are different between the two versions:

        -----  WITHOUT VIEWS  ----- 
    
    memcpy
    NRT_Allocate
    NRT_Free
    NRT_decref
    NRT_incref
    NRT_MemInfo_alloc_safe_aligned
    
        -----   WITH VIEWS    -----
    
    numba_do_raise
    numba_gil_ensure
    numba_gil_release
    numba_unpickle
    PyErr_Clear
    NRT_decref
    NRT_MemInfo_alloc_safe_aligned
    

    The NRT_Allocate, NRT_Free, NRT_decref, NRT_incref function calls indicate that the compiled code create a new Python object in the middle of the hot loop which is very inefficient. Meanwhile, the second version does not perform any NRT_incref and I suspect NRT_decref is never actually called (or maybe just once). The second code performs no Numpy array allocations. It looks like the calls to PyErr_Clear, numba_do_raise and numba_unpickle are made to manage exception that can possibly be raised (but not in the first version surprizingly so it is likely related to the use of views). Finally, the call to memcpy in the first version shows that the newly created array is certainly copied to the x. The allocation and the copy makes the first version very inefficient.

    I am pretty surprized that Numba does not generate allocations for zeros(3). This is grea, but you should really avoid creating arrays in hot loops like this since there is no garantee Numba will always optimize such a call. In fact, it often does not.

    You can use a basic loop to copy all items of a slice so to avoid any allocation. This is often faster if the size of the slice is known at compile time. Slice copies could be faster since the copiler might better vectorize the code, but in practice, such loops are relatively well auto-vectorized.

    One can note that there is the vsqrtsd instruction in the code of both versions so the lambda is actually inlined.

    When you move the lambda away of the function and put its content in another jitted function, LLVM may not inline the function. You can request Numba to inline the function manually before the generation of the intermediate representation (IR code) so that LLVM should generate a similar code. This can be done using the inline="always" flag. This tends to increase the compilation time though (since the code is nearly copy-pasted in the caller function). Inlining is critical for applying many further optimizations (constant propagation, SIMD vectorization, etc.) which can result in a huge performance boost.