assembly cpu-architecture simd avx micro-optimization

When source registers in avx instruction can be reused

When registers which are used in avx instruction as source can be reused after instruction starts processing?

For example: I want to use vgatherdps instruction which consumes two ymm registers one of which is displacement index. I realised that vgatherdps takes a lot of time for gathering is data has poor locality.

Whether the displacement index register will be held during the execution of instructions or I can reuse it in folowed instruction without hanging of pipeline?

Solution

All x86 CPUs with AVX do out-of-order execution with register renaming to hide Write-After-Write and Write-After-Read hazards. See

Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) (the part about hazards and register renaming near the top of my answer)
Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs which similarly explains that this is a non-problem.
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
How many CPU cycles are needed for each assembly instruction? - dependency chains are what matter for performance; after register renaming, only RAW (read after write) true dependencies matter.

You never have to worry about a write-only access to a register stalling because of execution of a slow instruction reading or writing the previous value. (Out-of-order exec has its limits, and number of physical register-file entries is one of them, but that's a separate factor from WAR / WAW hazards.)

The whole point of register renaming is to make new (independent) uses of the same register perform like they're using a different register, allowing the CPU to exploit the instruction-level parallelism.

For example, vmovdqa ymm2, [rdi] doesn't care about previous instructions reading or writing ymm2 (or its xmm2 low half); vmovdqa's destination is always write-only.

Since you mention gathers, vgatherdps itself is not write-only on its destination; it merges according to the mask vector. So if you gather into the same register repeatedly in a loop (say ymm0), you might want to vpxor xmm0,xmm0,xmm0 to break the dependency.

But you may not need to; on Intel CPUs the actual loads of gather elements can start even if the read-write destination register isn't "ready" yet as an input. https://uops.info/ measured the latency from operand 1 to operand 1 on Skylake at 1 cycle latency. (At least when the mask is all-ones; that could possibly be special-cased for the non-faulting case).

So vgatherdps ymm0, [rdi+ymm5*4], ymm1 can write ymm0 in the cycle after ymm0 becomes ready (if ymm5 and ymm1, and the pointed-to memory, were ready 22 cycles earlier). (Gather throughput is worse than that; they measure that by using a chain of instructions like 10x vshufpd ymm0, ymm0, ymm0, 0, as you can see in Experiment 2 and 3 in that link.)

However, things aren't so great on Zen3, for example. vgatherdps ymm on Zen 3 has latency from operand 1 -> 1 of 8 cycles. But that's still a lot shorter than the 28 cycle latency from index vector ready -> destination vector ready. (2 -> 1)

(For normal gathers with the mask vector set to all-ones, you'd use vpcmpeqd ymm1, ymm1, ymm1. It's recognized as independent of the previous value, like an xor-zeroing idiom, so it does count as write-only even though you're using an instruction that looks like it would actually read and compare. That means you're already breaking the dep chain involving the mask vector. Interestingly on Skylake, there's 0 cycle latency from mask input to output if you do intentionally avoid breaking the dependency. See the 3->1 section on the uops.info Skylake latency page. Presumably gathers work like vpxor-zeroing for the mask, only doing it differently if there's a page fault (or other fault) on an element.)