Why does this code break when -O2 or higher is enabled?

I tried to fit an implementation of NSA's SPECK in a 8-bit PIC microcontroller. The free version of their compiler (based on CLANG) won't enable optimizations so I ran out of memory. I tried the "trial" version that enables -O2, -O3 and -Os (optimize for size). With -Os It managed to fit my code in the 2K program memory space.

Here's the code:

#include <stdint.h>
#include <string.h>

#define ROR(x, r) ((x >> r) | (x << (32 - r)))
#define ROL(x, r) ((x << r) | (x >> (32 - r)))
#define R(x, y, k) (x = ROR(x, 8), x += y, x ^= k, y = ROL(y, 3), y ^= x)
#define ROUNDS 27

void encrypt_block(uint32_t ct[2],
        uint32_t const pt[2],
        uint32_t const K[4]) {
    uint32_t x = pt[0], y = pt[1];
    uint32_t a = K[0], b = K[1], c = K[2], d = K[3];

    R(y, x, a);
    for (int i = 0; i < ROUNDS - 3; i += 3) {
        R(b, a, i);
        R(y, x, a);
        R(c, a, i + 1);
        R(y, x, a);
        R(d, a, i + 2);
        R(y, x, a);
    }
    R(b, a, ROUNDS - 3);
    R(y, x, a);
    R(c, a, ROUNDS - 2);
    R(y, x, a);

    ct[0] = x;
    ct[1] = y;
}

Unfortunately, when debugging it line by line, comparing it to the test vectors in the implementation guide, from page 32, "15 SPECK64/128 Test Vectors", the results difer from the expected results.

Here's a way to call this function:

uint32_t out[2];
uint32_t in[] = { 0x7475432d, 0x3b726574 };
uint32_t key[] = { 0x3020100, 0xb0a0908, 0x13121110, 0x1b1a1918 };

encrypt_block(out, in, key);

assert(out[0] == 0x454e028b);
assert(out[1] == 0x8c6fa548);

The expected value for "out", according to the guide, should be 0x454e028b, 0x8c6fa548. The result I'm getting with -O2 is 0x8FA3FED7 0x53D8CEA8. With -O1, I get 0x454e028b, 0x8c6fa548, which is the correct result.

Step Debugging

The implentation guide includes all the intermediate key schedule other values, so I stepped through the code line by line, comparing the results to the guide.

The expected results for "x" are: 03020100, 131d0309, bbd80d53, 0d334df3. I start step debugging, but when reaching the 4th result, 0d334df3, the debugger window shows 0d334df0 instead. By the next round, the expected 7fa43565 value is 7FA43578 and only gets worse with every iteration.

This only happens when -O2 or greater is enabled. With no optimizations, or with -O1, the code works as expected.

Solution

It was a bug in the compiler.

I posted the question in the manufacturer's forum. Other people have indeed reproduced the issue, which happens when compiling for certain parts. Other parts are unaffected.

As a workaround, I changed the macros into real functions, and split the operation in two lines:

uint32_t ROL(uint32_t x, uint8_t r) {
    uint32_t intermedio;
    intermedio = x << r;
    intermedio |= x >> (32 - r);
    return intermedio;
}

This gives the correct result.