Why loop unroll brings so much speedup on ARM Cortex-a53?

I'm playing around with loop unroll with the following code on a ARM Cortex-a53 processor running in AArch64 state:

void do_something(uint16_t* a, uint16_t* b, uint16_t* c, size_t array_size)
{
  for (int i = 0; i < array_size; i++)
  {
    a[i] = a[i] + b[i];
    c[i] = a[i] * 2;
  }
}

With flag -O1, I got the following assembly,

.L3:
    ldrh    w3, [x0, x4]
    ldrh    w5, [x1, x4]
    add     w3, w3, w5
    and     w3, w3, 65535
    strh    w3, [x0, x4]
    ubfiz   w3, w3, 1, 15
    strh    w3, [x2, x4]
.LVL2:
    add x4, x4, 2
.LVL3:
    cmp x4, x6
    bne .L3

which finished in 162ms (the size of a, b, c are big). For simplicity I left out some prolog and epilog codes before the loop, but they are just for stack setup, etc.

Then I unrolled the loop which results in code like the following:

void add1_opt1(uint16_t* a, uint16_t* b, uint16_t* c, size_t array_size)
{
  for (int i = 0; i < array_size/4; i+=4)
  {
    a[i]   = a[i] + b[i];
    c[i]   = a[i] * 2;
    a[i+1] = a[i+1] + b[i+1];
    c[i+1] = a[i+1] * 2;
    a[i+2] = a[i+2] + b[i+2];
    c[i+2] = a[i+2] * 2;
    a[i+3] = a[i+3] + b[i+3];
    c[i+3] = a[i+3] * 2;
  }
}

which gives assembly like the following (still with -O1, since with -O0 the compiler was doing something kind of stupid):

.L7:
    ldrh    w1, [x0]
    ldrh    w5, [x3]
    add w1, w1, w5
    and w1, w1, 65535
    strh    w1, [x0]
    ubfiz   w1, w1, 1, 15
    strh    w1, [x2]
    ldrh    w1, [x0, 2]
    ldrh    w5, [x3, 2]
    add w1, w1, w5
    and w1, w1, 65535
    strh    w1, [x0, 2]
    ubfiz   w1, w1, 1, 15
    strh    w1, [x2, 2]
    ldrh    w1, [x0, 4]
    ldrh    w5, [x3, 4]
    add w1, w1, w5
    and w1, w1, 65535
    strh    w1, [x0, 4]
    ubfiz   w1, w1, 1, 15
    strh    w1, [x2, 4]
    ldrh    w1, [x0, 6]
    ldrh    w5, [x3, 6]
    add w1, w1, w5
    and w1, w1, 65535
    strh    w1, [x0, 6]
    ubfiz   w1, w1, 1, 15
    strh    w1, [x2, 6]
.LVL8:
    add x4, x4, 4
.LVL9:
    add x0, x0, 8
    add x3, x3, 8
    add x2, x2, 8
    cmp x4, x6
    bcc .L7

which was almost like copying and pasting the other assembly code 4 times. And the question is, why this piece of code only took 28ms to run, which was like 5x speed up. With simple loop condition like this, I assumed the branch prediction should do a pretty good job in both codes right? And in the second assembly code, the stores were also interleaved. So I cannot imagine how such code can get that much speedup.

Solution

The problem is here: for (int i = 0; i < array_size/4; i+=4).

Looping until array_size/4 will do a quarter of the work.

It should have been for (int i = 0; i < array_size; i+=4).

Then you should see a more explainable speedup of a few percent.