I have a ARM Neon function that uses C type struct as the argument.
I have a float*
and float[]
array of fixed size in that struct. I am able to access float*
elements in my assembly function. But when I try to access elements of array, my program crashes.
Here is my C side application:
typedef struct{
float* f1;
float* f2;
float f3[4];
}P_STRUCT;
main.c file:
extern void myNeonFunc(P_STRUCT* p, float* res);
P_STRUCT p;
// memory allocation for f1,f2 and fill array f3 here.
// memory allocation for res
myNeonFunc(&p, res);
And here is my .S file:
.text
.set P_STRUCT_F1, 0 @ float* f1
.set P_STRUCT_F2, 4 @ float* f2
.set P_STRUCT_F3, 8 @ float f3[4]
.globl myNeonFunc
@ void myNeonFunc (P_STRUCT* p ----> r0, r1 )
.balign 64 @ align the function to 64
myNeonFunc:
@save callee-save registers here
ldr r8, [r0,P_STRUCT_F1] @ r8 <- f1
add r8, r8, #8 @ r8 points to the f1[2] (2*4 = 8 )
ldr r9, [r0,P_STRUCT_F2] @ r9 <- f2
add r9, r9, #4 @ r9 points to the f2[1] (1*4 = 8)
ldr r10, [r0,P_STRUCT_F3] @ r10 <- f3
add r10, r10, #4 @ r10 points to the f3[1] (1*4 = 8)
vld1.f32 {d4}, [r8]! @ d4 now contains the corresponding r8 value
vld1.f32 {d6}, [r9]! @ d6 now contains the corresponding r9 value
vst1.32 {d4}, [r1]! @ store f1[2] value in result register
vst1.32 {d6}, [r1]! @ store f1[1] value in result register
// every thing is ok up to here
// this line probably causes seg fault !!!
vld1.f32 {d8}, [r10]! @ d8 now contains the corresponding r10 value
//
vst1.32 {d8}, [r1]! @ store f3[1] value in result register
// epilog part here ...
This problem might be due to the fact that r10
does not point to the address of f3
array.(maybe)
Now my question is that why accessing fixed size array causes problem here while accessing pointer elements is OK. And what is the solution for that.
A pointer is not the same thing as an array. f1
and f2
are 4 byte pointers in the struct. f3
is a 16-byte array in the struct. The struct as a whole is 24 bytes long.
What you are loading into r10
is the first element of f3
. If you want to set r10
to &f3[0]
, then just set r10
to r0
+ P_STRUCT_F3
.