I'm finding it surprisingly difficult to find good, complete examples of assembly running on Apple Silicon, specifically for SIMD-type operations, rather than incomplete, overly-generic snippets.
For my own curiosity, I want to write an example on an M2 machine that...
I have the following source code, in a file named test.s
...
.global _start
.align 2
_start:
;;; Load numbers into x0
ldr x0, numbers
;;; Load elements from array in x0 into dedicated Neon register
ld1 { v0. 4s }, [x0]
;;; Accumulate elements in vector using dedicated Neon instruction
addv s0, v0.4s
;;; Prepare formatted string
adrp x0, format@page
add x0, x0, format@pageoff
;;; Add result to the stack for printing
str s0, [sp, #-16]!
;;; Print string
bl _printf
mov x16, #1
svc 0
numbers: .word 1, 2, 3, 4
format: .asciz "Answer: %u.\n"
..., assembled and linked using the following commands...
as -g -arch arm64 -o test.o test.s
ld -o test test.o -lSystem -syslibroot `xcrun -sdk macosx --show-sdk-path` -e _start -arch arm64
I'd have expected the answer to be 10
when I run the programme, but I get anything but.
What is it I'm not doing correctly?
ldr x0, numbers
is going to load from the address labeled numbers
into x0
(which only works because numbers
happens to be at a sufficiently nearby address to the instruction, in the same section). So the value in x0
will not be the address of numbers
, but rather the data stored there. You'll end up with x0
containing the value 0x0000000200000001
and the subsequent memory access will likely crash.
You should put the address of numbers
into x0
with an adrp/add
sequence just like you do with format
further down.
Also, st1
should be ld1
, as you already mentioned.
Changing these lines to
adrp x0, numbers@page
add x0, x0, numbers@pageoff
ld1 { v0.4s }, [x0]
makes the program print the correct value 10
for me.