optimization assembly android-ndk arm neon

Why does setting LOCAL_ARM_NEON double the speed without special code-paths?

I'm using the NDK on Android for some heavy numerical code, using the LLVM toolchain.

I've found that if I set LOCAL_ARM_NEON := true in my Android.mk, I get almost a 50% speedup in my code. I have not written any NEON specific source-files, and have no NEON intrinsics in my code. Does this mean that the compiler is automatically injecting NEON instructions into its code?

If this is the case, then because this is all generated in the compiler, I can't wrap the NEON code-paths with a check for hardware support. Is there a best-practice here? Or is LOCAL_ARM_NEON := true fundamentally unsafe?

Further details: (caveat, I'm not very experienced reading ARM assembly)

Comparison of generated assembly.

I find the slower assembly here to be quite readable. The faster assembly I find very challenging to read. ~~I also can't tell if it has NEON specific instructions in it, since both generated files seem to have vmul instructions which this page claims are NEON specific~~ EDIT: apparently vmul is not NEON specific.

Slower code (not setting the `LOCAL_ARM_NEON` flag)

00000000 <_dotProduct>:
   0:   ed9f 0a08   vldr    s0, [pc, #32]   ; 24 <_dotProduct+0x24>
   4:   2a01        cmp r2, #1
   6:   bfb8        it  lt
   8:   4770        bxlt    lr
   a:   ed90 1a00   vldr    s2, [r0]
   e:   3004        adds    r0, #4
  10:   ed91 2a00   vldr    s4, [r1]
  14:   3104        adds    r1, #4
  16:   3a01        subs    r2, #1
  18:   ee22 1a01   vmul.f32    s2, s4, s2
  1c:   ee31 0a00   vadd.f32    s0, s2, s0
  20:   d1f3        bne.n   a <_dotProduct+0xa>
  22:   4770        bx  lr
  24:   00000000    .word   0x00000000

Faster code (with `LOCAL_ARM_NEON := true`)

00000000 <_dotProduct>:
   0:   b510        push    {r4, lr}
   2:   2a01        cmp r2, #1
   4:   db1b        blt.n   3e <_dotProduct+0x3e>
   6:   2a00        cmp r2, #0
   8:   d01c        beq.n   44 <_dotProduct+0x44>
   a:   efc0 0050   vmov.i32    q8, #0  ; 0x00000000
   e:   f022 0c03   bic.w   ip, r2, #3
  12:   f1bc 0f00   cmp.w   ip, #0
  16:   d01a        beq.n   4e <_dotProduct+0x4e>
  18:   46e6        mov lr, ip
  1a:   460b        mov r3, r1
  1c:   4604        mov r4, r0
  1e:   f964 2a8f   vld1.32 {d18-d19}, [r4]
  22:   f1be 0e04   subs.w  lr, lr, #4
  26:   f104 0410   add.w   r4, r4, #16
  2a:   f963 4a8f   vld1.32 {d20-d21}, [r3]
  2e:   f103 0310   add.w   r3, r3, #16
  32:   ff44 2df2   vmul.f32    q9, q10, q9
  36:   ef42 0de0   vadd.f32    q8, q9, q8
  3a:   d1f0        bne.n   1e <_dotProduct+0x1e>
  3c:   e009        b.n 52 <_dotProduct+0x52>
  3e:   ef80 0010   vmov.i32    d0, #0  ; 0x00000000
  42:   bd10        pop {r4, pc}
  44:   ef80 0010   vmov.i32    d0, #0  ; 0x00000000
  48:   f04f 0c00   mov.w   ip, #0
  4c:   e00b        b.n 66 <_dotProduct+0x66>
  4e:   f04f 0c00   mov.w   ip, #0
  52:   eff0 28e0   vext.8  q9, q8, q8, #8
  56:   4594        cmp ip, r2
  58:   ef40 0de2   vadd.f32    q8, q8, q9
  5c:   fffc 2c60   vdup.32 q9, d16[1]
  60:   ef00 0de2   vadd.f32    q0, q8, q9
  64:   d011        beq.n   8a <_dotProduct+0x8a>
  66:   eb01 018c   add.w   r1, r1, ip, lsl #2
  6a:   eb00 008c   add.w   r0, r0, ip, lsl #2
  6e:   eba2 020c   sub.w   r2, r2, ip
  72:   ed90 2a00   vldr    s4, [r0]
  76:   3004        adds    r0, #4
  78:   ed91 3a00   vldr    s6, [r1]
  7c:   3104        adds    r1, #4
  7e:   3a01        subs    r2, #1
  80:   ff43 0d12   vmul.f32    d16, d3, d2
  84:   ef00 0d80   vadd.f32    d0, d16, d0
  88:   d1f3        bne.n   72 <_dotProduct+0x72>
  8a:   bd10        pop {r4, pc}

Solution

OK I'm going to answer this myself based on the helpful comments from @Michael and @Notlikethat. My speedup, then, is because of the NEON instructions (of course).

It appears that setting LOCAL_ARM_NEON := true allows the compiler to generate NEON instructions, even for non .neon files. This will make the code unportable to ARMv7 that does not support NEON.

I think this gives me two choices, one: compile a version of my lib with and without LOCAL_ARM_NEON := true and decide in Java which one to load based on whether the CPU supports NEON.

The second would be to not set LOCAL_ARM_NEON := true, but instead to duplicate my performance-sensitive code-paths into a .c.neon file (which will allow only that file to be compiled with NEON support. Then, in the main file, use the cpufeatures lib to detect NEON support and switch to that file if available.

Why does setting LOCAL_ARM_NEON double the speed without special code-paths?

Comparison of generated assembly.

Slower code (not setting the LOCAL_ARM_NEON flag)

Faster code (with LOCAL_ARM_NEON := true)

Slower code (not setting the `LOCAL_ARM_NEON` flag)

Faster code (with `LOCAL_ARM_NEON := true`)