I'm playing around with loop unroll with the following code on a ARM Cortex-a53 processor running in AArch64 state:
void do_something(uint16_t* a, uint16_t* b, uint16_t* c, size_t array_size)
{
for (int i = 0; i < array_size; i++)
{
a[i] = a[i] + b[i];
c[i] = a[i] * 2;
}
}
With flag -O1, I got the following assembly,
.L3:
ldrh w3, [x0, x4]
ldrh w5, [x1, x4]
add w3, w3, w5
and w3, w3, 65535
strh w3, [x0, x4]
ubfiz w3, w3, 1, 15
strh w3, [x2, x4]
.LVL2:
add x4, x4, 2
.LVL3:
cmp x4, x6
bne .L3
which finished in 162ms (the size of a, b, c are big). For simplicity I left out some prolog and epilog codes before the loop, but they are just for stack setup, etc.
Then I unrolled the loop which results in code like the following:
void add1_opt1(uint16_t* a, uint16_t* b, uint16_t* c, size_t array_size)
{
for (int i = 0; i < array_size/4; i+=4)
{
a[i] = a[i] + b[i];
c[i] = a[i] * 2;
a[i+1] = a[i+1] + b[i+1];
c[i+1] = a[i+1] * 2;
a[i+2] = a[i+2] + b[i+2];
c[i+2] = a[i+2] * 2;
a[i+3] = a[i+3] + b[i+3];
c[i+3] = a[i+3] * 2;
}
}
which gives assembly like the following (still with -O1, since with -O0 the compiler was doing something kind of stupid):
.L7:
ldrh w1, [x0]
ldrh w5, [x3]
add w1, w1, w5
and w1, w1, 65535
strh w1, [x0]
ubfiz w1, w1, 1, 15
strh w1, [x2]
ldrh w1, [x0, 2]
ldrh w5, [x3, 2]
add w1, w1, w5
and w1, w1, 65535
strh w1, [x0, 2]
ubfiz w1, w1, 1, 15
strh w1, [x2, 2]
ldrh w1, [x0, 4]
ldrh w5, [x3, 4]
add w1, w1, w5
and w1, w1, 65535
strh w1, [x0, 4]
ubfiz w1, w1, 1, 15
strh w1, [x2, 4]
ldrh w1, [x0, 6]
ldrh w5, [x3, 6]
add w1, w1, w5
and w1, w1, 65535
strh w1, [x0, 6]
ubfiz w1, w1, 1, 15
strh w1, [x2, 6]
.LVL8:
add x4, x4, 4
.LVL9:
add x0, x0, 8
add x3, x3, 8
add x2, x2, 8
cmp x4, x6
bcc .L7
which was almost like copying and pasting the other assembly code 4 times. And the question is, why this piece of code only took 28ms to run, which was like 5x speed up. With simple loop condition like this, I assumed the branch prediction should do a pretty good job in both codes right? And in the second assembly code, the stores were also interleaved. So I cannot imagine how such code can get that much speedup.
The problem is here: for (int i = 0; i < array_size/4; i+=4)
.
Looping until array_size/4
will do a quarter of the work.
It should have been for (int i = 0; i < array_size; i+=4)
.
Then you should see a more explainable speedup of a few percent.