Search code examples
arm64-bitneon

Neon 64bit aarch64: confusion about ld4r


I'm confused by the new ld4r instruction in aarch64.

The following code (loading SAME 4 32-bit values into v[20-23]):

ld1 { v20.4s }, [out1]
ld1 { v21.4s }, [out1]
ld1 { v22.4s }, [out1]
ld1 { v23.4s }, [out1]

seems to be equivalent to the following code:

ld1 { v20.4s }, [out1]
mov v21.16b, v20.16b
mov v22.16b, v20.16b
mov v23.16b, v20.16b

but it doesn't seem to be equivalent to the following line:

ld4r { v20.4s, v21.4s, v22.4s, v23.4s }, [out1]

Am I misreading the ld4r instruction? Isn't it supposed to replicate over 4 lanes?


Solution

  • It seems that ld4r only loads a single 4-element structure and it replicates it across the SAME lane. This is not a lane-to-lane replication.