I am trying to build software to run on aws graviton3
. To get the most out of the performance, aws advice to use sse2neon to port codes with SSE intrinsics to neon (porting-codes-with-sseavx-intrinsics-to-neon)
While modifying the headers I found that arm_neon.h
is included when arm64 architecture is detected. Is there any benefits of using sse2neon
instead of arm_neon.h
? Should include both headers side by side?
And what is the difference between them anyway?
Is there any befinifts of using
sse2neon
instead ofarm_neon.h
?
The benefit is that you can compile code written to use use x86 SSE intrinsics like _mm_add_epi32
on __m128i
vectors (Intel intrinsics guide), instead of having to manually port to use uint32x4_t
with vaddq_u32
(ARM intrinsics guide filtered for integer addition on AArch64 with NEON)
NEON and SSE2 are different instruction sets with some different instructions, e.g. different shuffles. And NEON having a lot of horizontal pairwise stuff like addition. But x86 having _mm_movemask_epi8
to take one bit per byte of the vector and put it in an int
; x86 CPUs can fairly efficiently get data between SIMD and integer domains, useful for stuff like memcmp
or strlen
when you want to branch on SIMD compare results. ARM / AArch64 CPUs don't have an equivalent instruction.
Of course for simple stuff like simple vertical operations, there are 1:1 drop in replacements, so no benefit to porting by hand. But for stuff where the sse2neon
implementation of an x86 intrinsic takes multiple ARM intrinsics, it can be worth porting manually, especially if that's inside a loop, not just cleanup that runs once.