I have some code that uses the vtbl2_u8
ARM Neon intrinsic function. When I compile with armv7
or armv7s
architectures, this code compiles (and executes) correctly. However, when I try to compile targeting arm64
, I get errors:
simd.h: error: call to unavailable function 'vtbl2_u8'
My Xcode version is 6.1, iPhone SDK 8.1. Looking at arm64_neon_internal.h
, the definition for vtbl2_u8
has an __attribute__(unavailable)
. There is a definiton for vtbl2q_u8
, but it takes different parameter types. Is there a direct replacement for the vtbl2
intrinsic for arm64
?
As documented in the ARM NEON intrinsics reference ( http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf ), vtbl2_u8
is expected to be provided by compilers providing an ARM C Language Extensions implementation for AArch64 state in ARMv8-A. Note that the same document would suggest that vtbl2q_u8 is an Xcode extension, rather than an intrinsic which is expected to be supported by ACLE compilers.
The answer to your question then, is there should be no need for a replacement for vtbl2_u8
, as it should be provided. However, that doesn't help you with your real problem, which is how you can use the instruction with a compiler which does not provide it.
Looking at what you have available in Xcode, and what vtbl2_u8
is documented to map to, I think you should be able to emulate the expected behaviour with:
uint8x8_t vtbl2_u8 (uint8x8x2_t a, uint8x8_t b)
{
/* Build the 128-bit vector mask from the two 64-bit halves. */
uint8x16_t new_mask = vcombine_u8 (a.val[0], a.val[1]);
/* Use an Xcode specific intrinsic. */
return vtbl1q_u8 (new_mask, b);
}
Though I don't have an Xcode toolchain to test with, so you'll have to confirm that does what you expect.
If this is appearing in performance critical code, you may find that the vcombine_u8
is an unacceptable extra instruction. Fundamentally a uint8x8x2_t
lives in two consecutive registers, which gives a different layout between AArch64 and AArch32 (where Q0 was D0:D1).The vtbl2_u8
intrinsic requires a 16-bit mask.
Rewriting the producer of the uint8x8x2_t
data to produce a uint8x16_t
is the only other workaround for this, and is probably the one likely to work best. Note that even in compilers which provide the vtbl2_u8
intrinsic (trunk GCC and Clang at time of writing), an instruction performing the vcombine_u8
is inserted, so you may still be seeing extra move instructions behind the scenes.