Search code examples
c++compilationssecpu-registersintrinsics

When is __m128 in an xmm register?


Calling _mm_load_ps returns an __m128. In the Intel intrinsics guide it says:

Load 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) from memory into dst. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.

(Editor's note: Use _mm_loadu_ps for a maybe-unaligned load)


Does this mean that the 4 float pack resides in the xmm registers as long as the __m128 is alive? And would this then mean that having more __m128 on the stack than there are xmm registers available would cause spilling?


Solution

  • Does this mean that the 4 float pack resides in the xmm registers as long as the __m128 is alive?

    No. Intrinsics are compiled by the compiler, and vector variables will be subject to register allocation just like any other variable.

    As you note in your second sentence - you can write code with more __m128 variables than you have registers - which would spill to stack.

    The intrinsics API is designed to let you pretend you're writing in assembly, but load/store intrinsics really just communicating type / alignment information to the compiler.

    (alignof(__m128) = 16, so any spill/reload can be done with alignment-required instructions. And reloads may even use it as a memory source operand instead of loading into a register.)

    __m128 variables would also need to be spilled across a non-inline function call, especially in calling conventions that have no call-preserved XMM registers. (e.g. x86-64 System V). Windows x64 has several call-preserved XMM registers, but some are volatile (call-clobbered) so functions have a few XMM registers to play with.

    So it is guaranteed that having more __m128 than there are registers available causes spilling, and that having less will always avoid spilling?

    Compilers try very hard to schedule the instructions in an order which reduces spilling. In abstract terms, for example, you might write some code like this:

    int A = *<foo>;
    int B = *<foo+1>;
    int C = *<foo+2>;
    int D = A + B + C;
    

    You might think that this needs 4 registers because you created and assigned 4 variables, but it's highly likely that you end up with something which looks more like this at the machine level:

    int A = *<foo>;
    int B = *<foo+1>;
    int D = A + B
    int A = *<foo+2>;
    int D = D + A
    

    i.e. the compiler has reordered this code to minimize the number of physical registers needed.

    In reality it's hard to predict. Compilers aim to reduce register pressure because spilling is expensive, but might deliberately not absolutely reduce it to the lowest possible level because they also need to fetch data early to try and hide the load latency of memory fetches.

    In general it's recommend that you disassemble high performance code paths to make sure the compiler does what you expected it to do .. .