Search code examples
armsimdsve

ARM SVE: svld1(mask, ptr) vs svldff1(svptrue<>, ptr)


In ARM SVE there are masked load instructions svld1and there are also non-failing loads svldff1(svptrue<>).

Questions:

  • Does it make sense to do svld1 with a mask as opppose to svldff1?
  • The behaviour of mask in svldff1 seems confusing. Is there a practical reason to provide a not just svptrue mask for svldff1
  • Is there any performance difference between svld1 and svldff1

Solution

  • Both ldff1 and ld1 can be used to load a vector register. According my informal tests, on an AWS graviton processor, I find no performance difference, in the sense that both instructions (ldff1 and ld1) seem to have roughly the same performance characteristics. However, ldff1 will read and write to the first-fault register (FFR). It implies that you cannot do more than one ldff1 at any one time within an 'FFR group', since they are order sensitive and depend crucially on the FFR.

    Furthermore, the ldff1 instruction is meant to be used along with the rdffr instruction, the instruction that generates a mask indicating which loads were successful. Using the rdffr instruction will obviously add some cost. I am assuming that the instruction in question might need to run after ldff1w, thus increasing the latency by at least a cycle. Of course, then you have to do something with the mask that rdffr produces...

    Obviously, there is bound to be some small overhead tied to the FFR (clearing, setting, accessing).

    "Is there a practical reason to provide a not just svptrue mask for svldff1": The documentation states that the leading inactive elements (up to the fault) are predicated to zero.