I'm confused by the new ld4r instruction in aarch64.
The following code (loading SAME 4 32-bit values into v[20-23]):
ld1 { v20.4s }, [out1]
ld1 { v21.4s }, [out1]
ld1 { v22.4s }, [out1]
ld1 { v23.4s }, [out1]
seems to be equivalent to the following code:
ld1 { v20.4s }, [out1]
mov v21.16b, v20.16b
mov v22.16b, v20.16b
mov v23.16b, v20.16b
but it doesn't seem to be equivalent to the following line:
ld4r { v20.4s, v21.4s, v22.4s, v23.4s }, [out1]
Am I misreading the ld4r instruction? Isn't it supposed to replicate over 4 lanes?
It seems that ld4r only loads a single 4-element structure and it replicates it across the SAME lane. This is not a lane-to-lane replication.