// http://infocenter.arm.com/help/topic/com.arm.doc.dai0425/DAI0425_migrating_an_application_from_ARMv5_to_ARMv7_AR.pdf
// p. 4-21
.macro mul_col_f32 res_q, col0_d, col1_d
vmul.f32 \res_q, q8, \col0_d[0] @ multiply col element 0 by matrix col 0
vmla.f32 \res_q, q9, \col0_d[1] @ multiply-acc col element 1 by matrix col 1
vmla.f32 \res_q, q10, \col1_d[0] @ multiply-acc col element 2 by matrix col 2
vmla.f32 \res_q, q11, \col1_d[1] @ multiply-acc col element 3 by matrix col 3
.endm
// http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100748_0606_00_en/lmi1470147220260.html
// http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0203j/Cacjfjei.html
.globl mat44mulneon
.p2align 2 // what's this ?
.type mat44mulneon,%function
mat44mulneon:
.fnstart // not recognized by eclipse syntax coloring?
// ---------
vld1.32 {d16-d19}, [r1]! @ load first eight elements of matrix 0
vld1.32 {d20-d23}, [r1]! @ load second eight elements of matrix 0
vld1.32 {d0-d3}, [r2]! @ load first eight elements of matrix 1.
vld1.32 {d4-d7}, [r2]! @ load second eight elements of matrix 1.
mul_col_f32 q12, d0, d1 @ matrix 0 * matrix 1 col 0
mul_col_f32 q13, d2, d3 @ matrix 0 * matrix 1 col 1
mul_col_f32 q14, d4, d5 @ matrix 0 * matrix 1 col 2
mul_col_f32 q15, d6, d7 @ matrix 0 * matrix 1 col 3
vst1.32 {d24-d27}, [r0]! @ store first eight elements of result.
vst1.32 {d28-d31}, [r0]! @ store second eight elements of result.
// ---------
bx lr // Return by branching to the address in the link register.
.fnend
The code above, which I found on an ARM site (see links in comments), works on my ARM Cortex A9 machine, i.e. a ARMv7 one.
I am now trying to get this to run on a ARMv8 / aarch64 CPU. I have found this slide: porting to ARM64
At the very end, it shows a matrix multiplication code. But it uses loops and I guess (correct me if I'm not seeing this right) the code I posted would be faster if ported to the new ARMv8 mnemonics. The linked document also shows some v7 -> v8 changes, so e.g. I changed things like vmul.32 to fmul and so forth. The register names given in the example do not match the one in the code posted above. Since I am not exactly fluent in any ARM asm, I don't know what the equivalents are here. E.g. when I build my project, I get an error like:
operand 1 must be a SIMD vector register list -- `st1 {d24-d27},[r0]
I'm not sure that would be the only problem, though, so I rather ask: What changes need to be done to the code to run on the aarch64 machine?
Here's a rough AArch64 version of that routine:
.macro mul_col_f32 res, col
fmul \res, v16.4s, \col[0] // multiply col element 0 by matrix col 0
fmla \res, v17.4s, \col[1] // multiply-acc col element 1 by matrix col 1
fmla \res, v18.4s, \col[2] // multiply-acc col element 2 by matrix col 2
fmla \res, v19.4s, \col[3] // multiply-acc col element 3 by matrix col 3
.endm
.globl mat44mulneon
mat44mulneon:
ld1 {v16.4s, v17.4s, v18.4s, v19.4s}, [x1]
ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [x2]
mul_col_f32 v24.4s, v0.s // matrix 0 * matrix 1 col 0
mul_col_f32 v25.4s, v1.s // matrix 0 * matrix 1 col 1
mul_col_f32 v26.4s, v2.s // matrix 0 * matrix 1 col 2
mul_col_f32 v27.4s, v3.s // matrix 0 * matrix 1 col 3
st1 {v24.4s, v25.4s, v26.4s, v27.4s}, [x0]
ret
A few non-comprehensive notes about the conversion, in addition to general things mentioned in the linked presentation:
ld1
instruction, compared to 32 bytes with vld1
in AArch32. This avoids the need for incrementing the r0/r1/r2 or x0/x1/x2 pointers.fnstart
, .fnend
and .type
, they can be readded in the same places as in the original version if wanted@
isn't a comment character any longercol
parameter to mul_col_f32
is in the form v0.s
, contrary to v0.4s
. When selecting a specific lane, after concatenating with the [0]
suffix in the macro, one should omit the number of lanes, e.g. to pick the first lane of the v0.4s
register, it should be written as v0.s[0]
. The GNU assembler does allow v0.4s[0]
, but other assemblers (both the Clang/LLVM built-in assembler, and Microsoft's armasm64) only allow the former syntax.