Search code examples
assemblysimdarm64neonapple-silicon

Accumulate vector using Neon and print to stdout (assembly)


I'm finding it surprisingly difficult to find good, complete examples of assembly running on Apple Silicon, specifically for SIMD-type operations, rather than incomplete, overly-generic snippets.

For my own curiosity, I want to write an example on an M2 machine that...

  1. Takes a series of numbers (built into the assembly file, to begin with)
  2. Accumulates them into one result (using SIMD instructions)
  3. Output the result to stdout

I have the following source code, in a file named test.s...

.global _start
.align 2

_start:
    ;;; Load numbers into x0
    ldr x0, numbers

    ;;; Load elements from array in x0 into dedicated Neon register
    ld1 { v0. 4s }, [x0]

    ;;; Accumulate elements in vector using dedicated Neon instruction
    addv s0, v0.4s 

    ;;; Prepare formatted string
    adrp    x0, format@page
    add x0, x0, format@pageoff

    ;;; Add result to the stack for printing
    str s0, [sp, #-16]!

    ;;; Print string
    bl _printf
    mov x16, #1
    svc 0

numbers: .word 1, 2, 3, 4
format: .asciz "Answer: %u.\n"

..., assembled and linked using the following commands...

as -g -arch arm64 -o test.o test.s
ld -o test test.o -lSystem -syslibroot `xcrun -sdk macosx --show-sdk-path` -e _start -arch arm64

I'd have expected the answer to be 10 when I run the programme, but I get anything but.

What is it I'm not doing correctly?


Solution

  • ldr x0, numbers is going to load from the address labeled numbers into x0 (which only works because numbers happens to be at a sufficiently nearby address to the instruction, in the same section). So the value in x0 will not be the address of numbers, but rather the data stored there. You'll end up with x0 containing the value 0x0000000200000001 and the subsequent memory access will likely crash.

    You should put the address of numbers into x0 with an adrp/add sequence just like you do with format further down.

    Also, st1 should be ld1, as you already mentioned.

    Changing these lines to

        adrp    x0, numbers@page
        add x0, x0, numbers@pageoff
        ld1 { v0.4s }, [x0]
    

    makes the program print the correct value 10 for me.