Performance of unaligned SIMD load/store on aarch64

An older answer indicates that aarch64 supports unaligned reads/writes and has a mention about performance cost, but it's unclear if the answer covers only the ALU or SIMD (128-bit register) operations, too.

Relative to aligned 128-bit NEON loads and stores, how much slower (if at all) are unaligned 128-bit NEON loads and stores on aarch64?

Are there separate instructions for unaligned SIMD loads and stores (as is the case with SSE2) or are the known-aligned loads/stores the same instructions as potentially-unaligned loads/stores?

Solution

According to the Cortex-A57 Software Optimization Guide in section 4.6 Load/Store Alignment it says:

The ARMv8-A architecture allows many types of load and store accesses to be arbitrarily aligned. The Cortex-A57 processor handles most unaligned accesses without performance penalties. However, there are cases which reduce bandwidth or incur additional latency, as described below:

Load operations that cross a cache-line (64-byte) boundary

Store operations that cross a 16-byte boundary

So it may depend on the processor that you are using, out of order (A57, A72, A-72, A-75) or in-order (A-35, A-53, A-55). I didn't find any optimization guide for the in-order processors, however they do have a Hardware Performance Counter that you could use to check if the number of unaligned instructions do affect performance:

    0xOF_UNALIGNED_LDST_RETIRED Unaligned load-store

This can be used with the perf tool.

There are no special instructions for unaligned accesses in AArch64.