c++c++11 optimization compiler-optimization strict-aliasing

Using this pointer causes strange deoptimization in hot loop

I recently came across a strange deoptimization (or rather missed optimization opportunity).

Consider this function for efficient unpacking of arrays of 3-bit integers to 8-bit integers. It unpacks 16 ints in each loop iteration:

void unpack3bit(uint8_t* target, char* source, int size) {
   while(size > 0){
      uint64_t t = *reinterpret_cast<uint64_t*>(source);
      target[0] = t & 0x7;
      target[1] = (t >> 3) & 0x7;
      target[2] = (t >> 6) & 0x7;
      target[3] = (t >> 9) & 0x7;
      target[4] = (t >> 12) & 0x7;
      target[5] = (t >> 15) & 0x7;
      target[6] = (t >> 18) & 0x7;
      target[7] = (t >> 21) & 0x7;
      target[8] = (t >> 24) & 0x7;
      target[9] = (t >> 27) & 0x7;
      target[10] = (t >> 30) & 0x7;
      target[11] = (t >> 33) & 0x7;
      target[12] = (t >> 36) & 0x7;
      target[13] = (t >> 39) & 0x7;
      target[14] = (t >> 42) & 0x7;
      target[15] = (t >> 45) & 0x7;
      source+=6;
      size-=6;
      target+=16;
   }
}

Here is the generated assembly for parts of the code:

 ...
 367:   48 89 c1                mov    rcx,rax
 36a:   48 c1 e9 09             shr    rcx,0x9
 36e:   83 e1 07                and    ecx,0x7
 371:   48 89 4f 18             mov    QWORD PTR [rdi+0x18],rcx
 375:   48 89 c1                mov    rcx,rax
 378:   48 c1 e9 0c             shr    rcx,0xc
 37c:   83 e1 07                and    ecx,0x7
 37f:   48 89 4f 20             mov    QWORD PTR [rdi+0x20],rcx
 383:   48 89 c1                mov    rcx,rax
 386:   48 c1 e9 0f             shr    rcx,0xf
 38a:   83 e1 07                and    ecx,0x7
 38d:   48 89 4f 28             mov    QWORD PTR [rdi+0x28],rcx
 391:   48 89 c1                mov    rcx,rax
 394:   48 c1 e9 12             shr    rcx,0x12
 398:   83 e1 07                and    ecx,0x7
 39b:   48 89 4f 30             mov    QWORD PTR [rdi+0x30],rcx
 ...

It looks quite efficent. Simply a shift right followed by an and, and then a store to the target buffer. But now, look what happens when I change the function to a method in a struct:

struct T{
   uint8_t* target;
   char* source;
   void unpack3bit( int size);
};

void T::unpack3bit(int size) {
        while(size > 0){
           uint64_t t = *reinterpret_cast<uint64_t*>(source);
           target[0] = t & 0x7;
           target[1] = (t >> 3) & 0x7;
           target[2] = (t >> 6) & 0x7;
           target[3] = (t >> 9) & 0x7;
           target[4] = (t >> 12) & 0x7;
           target[5] = (t >> 15) & 0x7;
           target[6] = (t >> 18) & 0x7;
           target[7] = (t >> 21) & 0x7;
           target[8] = (t >> 24) & 0x7;
           target[9] = (t >> 27) & 0x7;
           target[10] = (t >> 30) & 0x7;
           target[11] = (t >> 33) & 0x7;
           target[12] = (t >> 36) & 0x7;
           target[13] = (t >> 39) & 0x7;
           target[14] = (t >> 42) & 0x7;
           target[15] = (t >> 45) & 0x7;
           source+=6;
           size-=6;
           target+=16;
        }
}

I thought the generated assembly should be quite the same, but it isn't. Here is a part of it:

...
 2b3:   48 c1 e9 15             shr    rcx,0x15
 2b7:   83 e1 07                and    ecx,0x7
 2ba:   88 4a 07                mov    BYTE PTR [rdx+0x7],cl
 2bd:   48 89 c1                mov    rcx,rax
 2c0:   48 8b 17                mov    rdx,QWORD PTR [rdi] // Load, BAD!
 2c3:   48 c1 e9 18             shr    rcx,0x18
 2c7:   83 e1 07                and    ecx,0x7
 2ca:   88 4a 08                mov    BYTE PTR [rdx+0x8],cl
 2cd:   48 89 c1                mov    rcx,rax
 2d0:   48 8b 17                mov    rdx,QWORD PTR [rdi] // Load, BAD!
 2d3:   48 c1 e9 1b             shr    rcx,0x1b
 2d7:   83 e1 07                and    ecx,0x7
 2da:   88 4a 09                mov    BYTE PTR [rdx+0x9],cl
 2dd:   48 89 c1                mov    rcx,rax
 2e0:   48 8b 17                mov    rdx,QWORD PTR [rdi] // Load, BAD!
 2e3:   48 c1 e9 1e             shr    rcx,0x1e
 2e7:   83 e1 07                and    ecx,0x7
 2ea:   88 4a 0a                mov    BYTE PTR [rdx+0xa],cl
 2ed:   48 89 c1                mov    rcx,rax
 2f0:   48 8b 17                mov    rdx,QWORD PTR [rdi] // Load, BAD!
 ...

As you see, we introduced an additional redundant load from memory before each shift (mov rdx,QWORD PTR [rdi]). It seems like the target pointer (which is now a member instead of a local variable) has to be always reloaded before storing into it. This slows down the code considerably (around 15% in my measurements).

First I thought maybe the C++ memory model enforces that a member pointer may not be stored in a register but has to be reloaded, but this seemed like an awkward choice, as it would make a lot of viable optimizations impossible. So I was very surprised that the compiler did not store target in a register here.

I tried caching the member pointer myself into a local variable:

void T::unpack3bit(int size) {
    while(size > 0){
       uint64_t t = *reinterpret_cast<uint64_t*>(source);
       uint8_t* target = this->target; // << ptr cached in local variable
       target[0] = t & 0x7;
       target[1] = (t >> 3) & 0x7;
       target[2] = (t >> 6) & 0x7;
       target[3] = (t >> 9) & 0x7;
       target[4] = (t >> 12) & 0x7;
       target[5] = (t >> 15) & 0x7;
       target[6] = (t >> 18) & 0x7;
       target[7] = (t >> 21) & 0x7;
       target[8] = (t >> 24) & 0x7;
       target[9] = (t >> 27) & 0x7;
       target[10] = (t >> 30) & 0x7;
       target[11] = (t >> 33) & 0x7;
       target[12] = (t >> 36) & 0x7;
       target[13] = (t >> 39) & 0x7;
       target[14] = (t >> 42) & 0x7;
       target[15] = (t >> 45) & 0x7;
       source+=6;
       size-=6;
       this->target+=16;
    }
}

This code also yields the "good" assembler without additional stores. So my guess is: The compiler is not allowed to hoist the load of a member pointer of a struct, so such a "hot pointer" should always be stored in a local variable.

So, why is the compiler unable to optimize out these loads?
Is it the C++ memory model that forbids this? Or is it simply a shortcoming of my compiler?
Is my guess correct or what is the exact reason why the optimization can't be performed?

The compiler in use was g++ 4.8.2-19ubuntu1 with -O3 optimization. I also tried clang++ 3.4-1ubuntu3 with similar results: Clang is even able to vectorize the method with the local target pointer. However, using the this->target pointer yields the same result: An extra load of the pointer before each store.

I checked the assembler of some similar methods and the result is the same: It seems that a member of this always has to be reloaded before a store, even if such a load could simply be hoisted outside the loop. I will have to rewrite a lot of code to get rid of these additional stores, mainly by caching the pointer myself into a local variable that is declared above the hot code. But I always thought fiddling with such details as caching a pointer in a local variable would surely qualify for premature optimization in these days where compilers have gotten so clever. But it seems I am wrong here. Caching a member pointer in a hot loop seems to be a necessary manual optimization technique.

Solution

Pointer aliasing seems to be the problem, ironically between this and this->target. The compiler is taking into account the rather obscene possibility that you initialized:

this->target = &this

In that case, writing to this->target[0] would alter the contents of this (and thus, this->target).

The memory aliasing problem is not restricted to the above. In principle, any use of this->target[XX] given an (in)appropriate value of XX might point to this.

I am better versed in C, where this can be remedied by declaring pointer variables with the __restrict__ keyword.