Search code examples
x86-64simdarm64

Does using SIMD have an initialisation cost


Do any of the commonly used consumer devices have a power/frequency ramp-up period before the SIMD subsystem can either start at all or to work on full frequency? Do we measure the stall in clock cycles or microseconds?

Conversely, how many non-SIMD instructions can one typically execute before the SIMD performance is lost, or is such a condition detected by some other means?

I'm mostly interested in modern arm64 (Cortex-A53,55,75,77 implementations, M1).

EDIT

The Intel case seems to be reasonably covered in SIMD instructions lowering CPU frequency, which leads to further links stating a maximum 8.5us period for "hard transition", where the execution units are in a halt state (if I understood it correctly). Also it contradicts my intuition: using AVX-512 instructions requires apparently the frequency to be ramped down.


Solution

  • This answer applies for PCs, not ARM64.

    Do any of the commonly used consumer devices have a power/frequency ramp-up period before the SIMD subsystem can either start at all or to work on full frequency?

    “no” for start at all. SSE is designed to be a replacement for x87 FPU. CPUs never power off just SIMD hardware because most programs occasionally use floating point math.

    However, Intel CPUs power off some of the hardware. First time a program uses 32-byte or 64-byte vectors, they will run a lot slower, until transitioned to the proper power state.

    For Intel Sandy Bridge, Ivy Bridge, Haswell, that penalty applies to 32-byte vectors.

    For Intel Skylake, that penalty applies to 32-byte and 64-byte vectors, warmup duration is 56000 clock cycles or 14 μs.

    For Intel Ice Lake and Tiger Lake, the penalty only applies to 64-byte vectors, warmup duration is about 50000 clock cycles.

    During that warm-up period, throughput is halved and instructions have extra latency. Note that warm-up is agnostic to instruction set, it only applies to the size of the vectors. AVX1, AVX2 and AVX512 instructions which handle 16-byte vectors always run at full speed.

    how many non-SIMD instructions can one typically execute before the SIMD performance is lost

    Skylake CPUs revert to idle state after 2.7 million clock cycles (675 μs) is spent running instructions with ≤ 16 bytes SIMD width.

    For more information, see microarchitecture guide by Agner Fog.