Neon 64bit aarch64: confusion about ld4r

I'm confused by the new ld4r instruction in aarch64.

The following code (loading SAME 4 32-bit values into v[20-23]):

ld1 { v20.4s }, [out1]
ld1 { v21.4s }, [out1]
ld1 { v22.4s }, [out1]
ld1 { v23.4s }, [out1]

seems to be equivalent to the following code:

ld1 { v20.4s }, [out1]
mov v21.16b, v20.16b
mov v22.16b, v20.16b
mov v23.16b, v20.16b

but it doesn't seem to be equivalent to the following line:

ld4r { v20.4s, v21.4s, v22.4s, v23.4s }, [out1]

Am I misreading the ld4r instruction? Isn't it supposed to replicate over 4 lanes?

Solution

It seems that ld4r only loads a single 4-element structure and it replicates it across the SAME lane. This is not a lane-to-lane replication.

gcc-arm-none-eabi 11.3 "is not implemented and will always fail"
What does this mean: .size _start, . - _start in assembler?
Cross compile arm assembly for x86
How do I cast a vector to a float64_t to check a SIMD compare for all-zero?
How do you startup the additional cores on an Allwinner H5?
float16_t rounding on ARM NEON
How to elegantly support ARM assembly on both MacOS and Android?
Accelerating matrix vector multiplication with ARM Neon Intrinsics on Raspberry Pi 4
arm compiler 5 do not fully respect volatile qualifier
Which variable types/sizes are atomic on STM32 microcontrollers?
sorry, unimplemented: Thumb-1 ‘hard-float’ VFP ABI - arm-linux-gnueabihf-gcc - targeting armv6
ARM NEON vectorization failure
Instrumentation of ARM Binaries
arm-none-eabi-gdb continues instead of stepping over in no-sdk baremetal assembly
gcc arm optimizes away parameters before System Call
ARM inline asm: exit system call with value read from memory
ARM V7 inline assembly - moving a C variable into a register
Message "Unable to run arm-none-eabi-gdb: cannot find libncurses.so.5"
debugserver is x86_64 binary running in translation, attached failed. Could not attach to pid :
EXC_BAD_ACCESS pointing me an arm line of code
GDB: 'set substitute-path' command does not work
How to print constexpr in C23 at compile-time?
Arm Cortex-M7 SAM-E70 x32-ld is keeping both weak and strong function definition
How to choose between compiling ARM assembly file if iOS device and using regular C if iOS simulator
roleAssignment with current user id
CPython as a library for C (To execute Python code from C)
gcc optimize variable away before systemcall
AArch64 bare-metal target (aarch64-none-elf) toolchain gdb don't work
vfmlalq_low_f16 and vfmlalq_high_f16 not setting their first operand to the result
Do we need a memory acquire barrier for one-shot spinlocks?