assembly optimization sse micro-optimization sse2

latency for 'pcmpeqb' - memory vs xmm register

i have these 2 options:

option 1:

loop:
...
     movdqu   xmm0, [rax]
     pcmpeqb  xmm0, [.zero_table]
...
...

align 16
.zero_table:
    DQ 0, 0

option 2:

pxor xmm1, xmm1
loop:
  ...
    movdqu   xmm0, [rax]
    pcmpeqb  xmm0, xmm1
  ...
  ...

since we have a loop and i think Memory operands have more latency cost, so im asking this question ... which option is better and have less latency cost ?

Solution

The 2nd option is pretty obviously better: fewer unfused-domain uops in the loop. So out-of-order exec can run ahead and not need as many physical registers or load buffers (or whatever exactly holds those load results until the ALU uop reads them). You almost always want to hoist constants out of loops. It's worth the 1 extra uop and small L1i / uop-cache footprint of the extra instruction.

(Nehalem and earlier Intel (P6-family) have register-read stalls if you read too many "cold" registers in one issue group of instructions, but that's only 10-year-old Intel CPUs, not AMD and not more recent Intel.)

pcmpeqb xmm, [mem] is 1 fused-domain uop (with that addressing mode) for the ROB, but takes two RS entries (just like a separate load and then pcmpeqb reg,reg). Of course the constant load has no input dependencies so can execute right away, but obviously costs cache read and load throughput resources.

The only question would be if this wasn't inside a loop.

A micro-fused ALU + load still only has the regular ALU uop latency from its register input to its register output. Out-of-order exec can do the load as early as it wants because the address has no dependencies. https://uops.info/ has detailed data on this.

But if rax (the pointer) might not be ready right away, then yes load-use latency becomes part of the critical path. (Address-generation takes time.)

BTW, the first option is bad; zero XMM registers with xorps or pxor xmm0,xmm0, not by loading a constant.

    xorps    xmm0, xmm0    ; as cheap as a NOP on Sandybridge-family, or one ALU uop on Zen
    pcmpeqb  xmm0, [rax]   ; requires alignment unless you can use vpcmpeqb

Outside a loop I guess you could possibly consider using all-zeros as a memory source operand if you were sure that the front-end was always a bottleneck and that your constant would very rarely cache-miss. Then you could keep it down to 2 instructions total even with an unaligned [rax]. But that costs data-cache footprint on something you could have generated with a 3-byte or 4-byte instruction.

But if you did have some other constant that took more than 1 or 2 instructions to create on the fly, I can't think of any real reason why it would be better to load the constant first or deref the register. Both rip-relative and [register] addressing modes can stay micro-fused in the back-end in Sandybridge-family. Of course without AVX the memory operand for pcmpeqb has to be aligned, so this may force your hand if you want to take save front-end bandwidth by folding one load into a memory source operand for the ALU op.

    movdqu  xmm0, [rax]
    pcmpeqb xmm0, [rel some_constant]