Having r1
,r3
and r4
of type uint32x4_t
loaded into NEON registers I have the following code:
r3 = veorq_u32(r0,r3);
r4 = r1;
r1 = vandq_u32(r1,r3);
r4 = veorq_u32(r4,r2);
r1 = veorq_u32(r1,r0);
And I was just wondering whether GCC actually translates r4 = r1
into the vmov
instruction. Looking at the disassembled code I wasn't surprised that it didn't. (moreover I can't figure out what the generated assembly code actually does)
Skimming through ARM's NEON intrinsics reference I couldn't find any simple vector->vector assignment intrinsic.
What's the easiest way to achieve this? I'm not sure how an inlined assembly code would look like since I don't know in which registers were r1
and r4
assigned by vld1q_u32
. I don't need an actual swap, just assignment.
C has a concept of an abstract machine. Assignments and other operations are described in terms of this abstract machine. The assignment r4 = r1;
says to assign r4 the value of r1 in the abstract machine.
When the compiler generates instructions for a program, it generally does not exactly mimic everything that occurs in the abstract machine. It translates the operations that occur in the abstract machine into processor instructions that get the same results. The compiler will skip things like move instructions if it can figure out that it can get the same results without them.
In particular, the compiler might not keep r1
in the same place every time. It might load it from memory into some register R7 the first time you need it. But then it might implement your statement r1 = vandq_u32(r1,r3);
by putting the result in R8 while keeping the original value of r1
in R7. Then, when you later have r4 = veorq_u32(r4,r2);
, the compiler can use the value in R7, because it still contains that value that r4
would have (from the r4 = r1;
statement) in the abstract machine.
Even if you explicitly wrote a vmov
intrinsic, the compiler might not issue an instruction for it, as long as it issues instructions that get the same result in the end.