vtbl2 intrinsics on ARM64 missing

I have some code that uses the vtbl2_u8 ARM Neon intrinsic function. When I compile with armv7 or armv7s architectures, this code compiles (and executes) correctly. However, when I try to compile targeting arm64, I get errors:

simd.h: error: call to unavailable function 'vtbl2_u8'

My Xcode version is 6.1, iPhone SDK 8.1. Looking at arm64_neon_internal.h, the definition for vtbl2_u8 has an __attribute__(unavailable). There is a definiton for vtbl2q_u8, but it takes different parameter types. Is there a direct replacement for the vtbl2 intrinsic for arm64?

Solution

As documented in the ARM NEON intrinsics reference ( http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf ), vtbl2_u8 is expected to be provided by compilers providing an ARM C Language Extensions implementation for AArch64 state in ARMv8-A. Note that the same document would suggest that vtbl2q_u8 is an Xcode extension, rather than an intrinsic which is expected to be supported by ACLE compilers.

The answer to your question then, is there should be no need for a replacement for vtbl2_u8, as it should be provided. However, that doesn't help you with your real problem, which is how you can use the instruction with a compiler which does not provide it.

Looking at what you have available in Xcode, and what vtbl2_u8 is documented to map to, I think you should be able to emulate the expected behaviour with:

uint8x8_t vtbl2_u8 (uint8x8x2_t a, uint8x8_t b)
{
  /* Build the 128-bit vector mask from the two 64-bit halves.  */
  uint8x16_t new_mask = vcombine_u8 (a.val[0], a.val[1]);
  /* Use an Xcode specific intrinsic.  */
  return vtbl1q_u8 (new_mask, b);
}

Though I don't have an Xcode toolchain to test with, so you'll have to confirm that does what you expect.

If this is appearing in performance critical code, you may find that the vcombine_u8 is an unacceptable extra instruction. Fundamentally a uint8x8x2_t lives in two consecutive registers, which gives a different layout between AArch64 and AArch32 (where Q0 was D0:D1).The vtbl2_u8 intrinsic requires a 16-bit mask.

Rewriting the producer of the uint8x8x2_t data to produce a uint8x16_t is the only other workaround for this, and is probably the one likely to work best. Note that even in compilers which provide the vtbl2_u8 intrinsic (trunk GCC and Clang at time of writing), an instruction performing the vcombine_u8 is inserted, so you may still be seeing extra move instructions behind the scenes.