I'm using the NDK on Android for some heavy numerical code, using the LLVM toolchain.
I've found that if I set LOCAL_ARM_NEON := true
in my Android.mk, I get almost a 50% speedup in my code. I have not written any NEON specific source-files, and have no NEON intrinsics in my code. Does this mean that the compiler is automatically injecting NEON instructions into its code?
If this is the case, then because this is all generated in the compiler, I can't wrap the NEON code-paths with a check for hardware support. Is there a best-practice here? Or is LOCAL_ARM_NEON := true
fundamentally unsafe?
Further details: (caveat, I'm not very experienced reading ARM assembly)
I find the slower assembly here to be quite readable. The faster assembly I find very challenging to read. I also can't tell if it has NEON specific instructions in it, since both generated files seem to have EDIT: apparently vmul is not NEON specific.vmul
instructions which this page claims are NEON specific
LOCAL_ARM_NEON
flag)00000000 <_dotProduct>:
0: ed9f 0a08 vldr s0, [pc, #32] ; 24 <_dotProduct+0x24>
4: 2a01 cmp r2, #1
6: bfb8 it lt
8: 4770 bxlt lr
a: ed90 1a00 vldr s2, [r0]
e: 3004 adds r0, #4
10: ed91 2a00 vldr s4, [r1]
14: 3104 adds r1, #4
16: 3a01 subs r2, #1
18: ee22 1a01 vmul.f32 s2, s4, s2
1c: ee31 0a00 vadd.f32 s0, s2, s0
20: d1f3 bne.n a <_dotProduct+0xa>
22: 4770 bx lr
24: 00000000 .word 0x00000000
LOCAL_ARM_NEON := true
)00000000 <_dotProduct>:
0: b510 push {r4, lr}
2: 2a01 cmp r2, #1
4: db1b blt.n 3e <_dotProduct+0x3e>
6: 2a00 cmp r2, #0
8: d01c beq.n 44 <_dotProduct+0x44>
a: efc0 0050 vmov.i32 q8, #0 ; 0x00000000
e: f022 0c03 bic.w ip, r2, #3
12: f1bc 0f00 cmp.w ip, #0
16: d01a beq.n 4e <_dotProduct+0x4e>
18: 46e6 mov lr, ip
1a: 460b mov r3, r1
1c: 4604 mov r4, r0
1e: f964 2a8f vld1.32 {d18-d19}, [r4]
22: f1be 0e04 subs.w lr, lr, #4
26: f104 0410 add.w r4, r4, #16
2a: f963 4a8f vld1.32 {d20-d21}, [r3]
2e: f103 0310 add.w r3, r3, #16
32: ff44 2df2 vmul.f32 q9, q10, q9
36: ef42 0de0 vadd.f32 q8, q9, q8
3a: d1f0 bne.n 1e <_dotProduct+0x1e>
3c: e009 b.n 52 <_dotProduct+0x52>
3e: ef80 0010 vmov.i32 d0, #0 ; 0x00000000
42: bd10 pop {r4, pc}
44: ef80 0010 vmov.i32 d0, #0 ; 0x00000000
48: f04f 0c00 mov.w ip, #0
4c: e00b b.n 66 <_dotProduct+0x66>
4e: f04f 0c00 mov.w ip, #0
52: eff0 28e0 vext.8 q9, q8, q8, #8
56: 4594 cmp ip, r2
58: ef40 0de2 vadd.f32 q8, q8, q9
5c: fffc 2c60 vdup.32 q9, d16[1]
60: ef00 0de2 vadd.f32 q0, q8, q9
64: d011 beq.n 8a <_dotProduct+0x8a>
66: eb01 018c add.w r1, r1, ip, lsl #2
6a: eb00 008c add.w r0, r0, ip, lsl #2
6e: eba2 020c sub.w r2, r2, ip
72: ed90 2a00 vldr s4, [r0]
76: 3004 adds r0, #4
78: ed91 3a00 vldr s6, [r1]
7c: 3104 adds r1, #4
7e: 3a01 subs r2, #1
80: ff43 0d12 vmul.f32 d16, d3, d2
84: ef00 0d80 vadd.f32 d0, d16, d0
88: d1f3 bne.n 72 <_dotProduct+0x72>
8a: bd10 pop {r4, pc}
OK I'm going to answer this myself based on the helpful comments from @Michael and @Notlikethat. My speedup, then, is because of the NEON instructions (of course).
It appears that setting LOCAL_ARM_NEON := true
allows the compiler to generate NEON instructions, even for non .neon files. This will make the code unportable to ARMv7 that does not support NEON.
I think this gives me two choices, one: compile a version of my lib with and without LOCAL_ARM_NEON := true
and decide in Java which one to load based on whether the CPU supports NEON.
The second would be to not set LOCAL_ARM_NEON := true
, but instead to duplicate my performance-sensitive code-paths into a .c.neon
file (which will allow only that file to be compiled with NEON support. Then, in the main file, use the cpufeatures
lib to detect NEON support and switch to that file if available.