Avoiding AVX-SSE (VEX) Transition Penalties

Our 64-bit application has lots of code (inter alia, in standard libraries) that use xmm0-xmm7 registers in SSE mode.

I would like to implement fast memory copy using ymm registers. I cannot modify all the code that uses xmm registers to add VEX prefix, and I also think that this is not practical, since it will increase the size of the code can make it run slower because of the need for the CPU to decode larger instructions.

I just want to use two ymm registers (and possibly zmm - the affordable processors supporting zmm are promised to be available this year) for fast memory copy.

Question is: how to use the ymm registers but avoid the transition penalties?

Will the penalty occur when I use just ymm8-ymm15 registers (not ymm0-ymm7)? SSE originally had eight 128-bit registers (xmm0-xmm7), but in 64-bit mode there are (xmm8-xmm15) also available for non-VEX-prefixed instructions. However, I have reviewed our 64-bit application and it only use xmm0-xmm7, since it also has a 32-bit version with almost the same code. Does the penalty only occur when the CPU tries in fact to use an xmm register that had been used before as ymm and has one of higher 128 bits non-zero? Isn't it better to just zeroize the ymm registers that I have used after the fast memory copy? For example, I have used an ymm register once to copy 32 bytes of memory - what is the fastest way to zeroize it? Is "vpxor ymm15, ymm15, ymm15" fast enough? (AFAIK, vpxor can be executed on any of the 3 ALU execution ports, p0/p1/p5, while vxorpd can only be execute on p5). Wouldn't be the time to zeroize it more than the gain of using it to just copy 32 bytes of memory?

Solution

Another possibility is to use registers zmm16 - zmm31. These regsters have no non-VEX counterpart. There is no state transition and no penalty for mixing zmm16 - zmm31 with non-VEX SSE code. These 512-bit registers are only available in 64 bit mode and only on processors with AVX512.