Do any of the commonly used consumer devices have a power/frequency ramp-up period before the SIMD subsystem can either start at all or to work on full frequency? Do we measure the stall in clock cycles or microseconds?
Conversely, how many non-SIMD instructions can one typically execute before the SIMD performance is lost, or is such a condition detected by some other means?
I'm mostly interested in modern arm64 (Cortex-A53,55,75,77 implementations, M1).
EDIT
The Intel case seems to be reasonably covered in SIMD instructions lowering CPU frequency, which leads to further links stating a maximum 8.5us period for "hard transition", where the execution units are in a halt state (if I understood it correctly). Also it contradicts my intuition: using AVX-512 instructions requires apparently the frequency to be ramped down.
This answer applies for PCs, not ARM64.
Do any of the commonly used consumer devices have a power/frequency ramp-up period before the SIMD subsystem can either start at all or to work on full frequency?
“no” for start at all. SSE is designed to be a replacement for x87 FPU. CPUs never power off just SIMD hardware because most programs occasionally use floating point math.
However, Intel CPUs power off some of the hardware. First time a program uses 32-byte or 64-byte vectors, they will run a lot slower, until transitioned to the proper power state.
For Intel Sandy Bridge, Ivy Bridge, Haswell, that penalty applies to 32-byte vectors.
For Intel Skylake, that penalty applies to 32-byte and 64-byte vectors, warmup duration is 56000 clock cycles or 14 μs.
For Intel Ice Lake and Tiger Lake, the penalty only applies to 64-byte vectors, warmup duration is about 50000 clock cycles.
During that warm-up period, throughput is halved and instructions have extra latency. Note that warm-up is agnostic to instruction set, it only applies to the size of the vectors. AVX1, AVX2 and AVX512 instructions which handle 16-byte vectors always run at full speed.
how many non-SIMD instructions can one typically execute before the SIMD performance is lost
Skylake CPUs revert to idle state after 2.7 million clock cycles (675 μs) is spent running instructions with ≤ 16 bytes SIMD width.
For more information, see microarchitecture guide by Agner Fog.