What is the difference between sse2neon and arm_neon.h?

I am trying to build software to run on aws graviton3. To get the most out of the performance, aws advice to use sse2neon to port codes with SSE intrinsics to neon (porting-codes-with-sseavx-intrinsics-to-neon)

While modifying the headers I found that arm_neon.h is included when arm64 architecture is detected. Is there any benefits of using sse2neon instead of arm_neon.h? Should include both headers side by side?

And what is the difference between them anyway?

Solution

Is there any befinifts of using sse2neon instead of arm_neon.h?

The benefit is that you can compile code written to use use x86 SSE intrinsics like _mm_add_epi32 on __m128i vectors (Intel intrinsics guide), instead of having to manually port to use uint32x4_t with vaddq_u32 (ARM intrinsics guide filtered for integer addition on AArch64 with NEON)

NEON and SSE2 are different instruction sets with some different instructions, e.g. different shuffles. And NEON having a lot of horizontal pairwise stuff like addition. But x86 having _mm_movemask_epi8 to take one bit per byte of the vector and put it in an int; x86 CPUs can fairly efficiently get data between SIMD and integer domains, useful for stuff like memcmp or strlen when you want to branch on SIMD compare results. ARM / AArch64 CPUs don't have an equivalent instruction.

Of course for simple stuff like simple vertical operations, there are 1:1 drop in replacements, so no benefit to porting by hand. But for stuff where the sse2neon implementation of an x86 intrinsic takes multiple ARM intrinsics, it can be worth porting manually, especially if that's inside a loop, not just cleanup that runs once.