Is uops.info wrong about vinserti128?

According to uops.info, the reciprocal throughput of vinserti128 is 0.5 if the xmm argument comes from memory, and 1 if the xmm argument is a register. What's the underlying reason behind this? Is it a mistake? Link

I don't understand why ports 0 and 1 are only an option if the xmm argument is not a register. Is there something in hardware that can explain this?

Solution

Uops.info automated testing doesn't have typos

If there's anything surprising, it's either a real effect or an artifact of exactly how they tested. You can click on any number to see the detailed test results including the instruction sequences used to test throughput and ports. In this case, yes, the memory-source version is different, and the test results make sense.

As chtz commented, the register-source version works like a shuffle, but the memory-source version is probably decoding as a broadcast-load + blend.

We know that Intel load ports can broadcast to 256-bit for free since Haswell (4, 8, or 16-byte chunks). vbroadcastss/d / vpbroadcastd/q are single-uop, so are vbroadcasti128 / vbroadcastf128. Even vmovddup ymm, [mem] is a single load-port uop on Intel (port 2 or 3), no ALU required for duplicating the low half of each 16-byte lane. (Filter on vbroadcast ymm in https://uops.info/)

For the blend part, Haswell and later could reuse the vpblendd execution unit by constructing a control operand for it from the vinserti128 immediate. That's a single uop for any vector ALU port (p015).

To avoid bypass latency, vinsertf128 could similarly use a vblendps or pd execution unit.

Sandy/Ivy Bridge could only broadcast for free to 128-bit XMM; it needed a port5 shuffle for vbroadcastss/sd/f128 ymm, mem (and didn't have register-source broadcasts; those were new in AVX2). But it ran vinserti/f128 mem as 1*p05 + 1*p23 vs p5 for register-source vinsert or vperm2f128.

SnB/IvB CPUs ran vblendps/pd on p05, so presumably vinsertf128 from memory is using the FP blend units on those ports. (It wasn't until Haswell that port 1 could also run vblendps.)

So presumably it had some special support for forwarding a 128-bit load result to either the low or high lane of an execution unit that could do the merging. (Perhaps with the blend units designed to accept that?)

If it could forward to both lanes, I'd have assumed it could do that in general, and wouldn't have needed a port 5 uop for vbroadcastf128 ymm, mem on top of the load uop. Unless it's relevant that the intermediate broadcast result never needs to be architecturally visible (written to a hard YMM register), potentially persisting indefinitely if the register isn't written for a long time. But combined with the blend control input generated from the immediate, the input can be a bit funky and only the output needs to be fully normal and potentially part of a persistent state.