How to load global data to NEON registers more efficiently in Go's Assembler?

There is p256one global data in the arm64 asm code as sample:

DATA p256one<>+0x00(SB)/8, $0x0000000000000001
DATA p256one<>+0x08(SB)/8, $0xffffffff00000000
DATA p256one<>+0x10(SB)/8, $0xffffffffffffffff
DATA p256one<>+0x18(SB)/8, $0x00000000fffffffe

GLOBL p256one<>(SB), 8, $32

I need to load p256one<>(SB) into V0 & V1 registers. Currently I used below method:

    LDP p256one<>+0x00(SB), (R0, R1)
    LDP p256one<>+0x10(SB), (R2, R3)
    VMOV R0, V0.D[0]
    VMOV R1, V0.D[1]
    VMOV R2, V1.D[0]
    VMOV R3, V1.D[1]

Total six directives are used here. We know we can load memory data as below:

    VLD1 (R0), [V0.B16, V1.B16]

But it seems we can't load global data with the same method.
So, is there a more efficient way to load global data into NEON registers in Go's Assembler code?


  • Try to load the address into a register, then load from that address:

        MOVD $p256one<>(SB), R0
        VLD1 (R0), [V0.B16, V1.B16]