Search code examples
c++ccastingtype-punning

strict aliasing and memory alignment


I have performance critical code and there is a huge function that allocates like 40 arrays of different size on the stack at the beginning of the function. Most of these arrays have to have certain alignment (because these arrays are accessed somewhere else down the chain using cpu instructions that require memory alignment (for Intel and arm CPUs).

Since some versions of gcc simply fail to align stack variables properly (notably for arm code), or even sometimes it says that maximum alignment for the target architecture is less than what my code actually requests, I simply have no choice but to allocate these arrays on the stack and align them manually.

So, for each array I need to do something like that to get it aligned properly:

short history_[HIST_SIZE + 32];
short * history = (short*)((((uintptr_t)history_) + 31) & (~31));

This way, history is now aligned on 32-byte boundary. Doing the same is tedious for all 40 arrays, plus this part of code is really cpu intensive and I simply cannot do the same alignment technique for each of the arrays (this alignment mess confuses the optimizer and different register allocation slows down the function big time, for better explanation see explanation at the end of the question).

So... obviously, I want to do that manual alignment only once and assume that these arrays are located one right after the other. I also added extra padding to these arrays so that they are always multiple of 32 bytes. So, then I simply create a jumbo char array on the stack and cast it to a struct that has all these aligned arrays:

struct tmp
{
   short history[HIST_SIZE];
   short history2[2*HIST_SIZE];
   ...
   int energy[320];
   ...
};


char buf[sizeof(tmp) + 32];
tmp * X = (tmp*)((((uintptr_t)buf) + 31) & (~31));

Something like that. Maybe not the most elegant, but it produced really good result and manual inspection of generated assembly proves that generated code is more or less adequate and acceptable. Build system was updated to use newer GCC and suddenly we started to have some artifacts in generated data (e.g. output from validation test suite is not bit exact anymore even in pure C build with disabled asm code). It took long time to debug the issue and it appeared to be related to aliasing rules and newer versions of GCC.

So, how can I get it done? Please, don't waste time trying to explain that it's not standard, not portable, undefined etc (I've read many articles about that). Also, there is no way I can change the code (I would perhaps consider modifying GCC as well to fix the issue, but not refactoring the code)... basically, all I want is to apply some black magic spell so that newer GCC produces the functionally same code for this type of code without disabling optimizations?

Edit:

  • I used this code on multiple OSes/compilers, but started to have issues when I switched to newer NDK which is based on GCC 4.6. I get the same bad result with GCC 4.7 (from NDK r8d)
  • I mention 32 byte alignment. If it hurts your eyes, substitute it with any other number that you like, for example 666 if it helps. There is absolutely no point to even mention that most architectures don't need that alignment. If I align 8KB of local arrays on stack, I loose 15 bytes for 16 byte alignment and I loose 31 for 32 byte alignment. I hope it's clear what I mean.
  • I say that there are like 40 arrays on the stack in performance critical code. I probably also need to say that it's a third party old code that has been working well and I don't want to mess with it. No need to say if it's good or bad, no point for that.
  • This code/function has well tested and defined behavior. We have exact numbers of the requirements of that code e.g. it allocates Xkb or RAM, uses Y kb of static tables, and consumes up to Z kb of stack space and it cannot change, since the code won't be changed.
  • By saying that "alignment mess confuses the optimizer" I mean that if I try to align each array separately code optimizer allocates extra registers for the alignment code and performance critical parts of code suddenly don't have enough registers and start trashing to stack instead which results in a slowdown of the code. This behavior was observed on ARM CPUs (I'm not worried about intel at all by the way).
  • By artifacts I meant that the output becomes non-bitexact, there is some noise added. Either because of this type aliasing issue or there is some bug in the compiler that results eventually in wrong output from the function.

    In short, the point of the question... how can I allocate random amount of stack space (using char arrays or alloca, and then align pointer to that stack space and reinterpret this chunk of memory as some structure that has some well defined layout that guarantees alignment of certain variables as long as the structure itself is aligned properly. I'm trying to cast the memory using all kinds of approaches, I move the big stack allocation to a separate function, still I get bad output and stack corruption, I'm really starting to think more and more that this huge function hits some kind of bug in gcc. It's quite strange, that by doing this cast I can't get this thing done no matter what I try. By the way, I disabled all optimizations that require any alignment, it's pure C-style code now, still I get bad results (non-bitexact output and occasional stack corruptions crashes). The simple fix that fixes it all, I write instead of:

    char buf[sizeof(tmp) + 32];
    tmp * X = (tmp*)((((uintptr_t)buf) + 31) & (~31));
    

    this code:

    tmp buf;
    tmp * X = &buf;
    

    then all bugs disappear! The only problem is that this code doesn't do proper alignment for the arrays and will crash with optimizations enabled.

    Interesting observation:
    I mentioned that this approach works well and produces expected output:

    tmp buf;
    tmp * X = &buf;
    

    In some other file I added a standalone noinline function that simply casts a void pointer to that struct tmp*:

    struct tmp * to_struct_tmp(void * buffer32)
    {
        return (struct tmp *)buffer32;
    }
    

    Initially, I thought that if I cast alloc'ed memory using to_struct_tmp it will trick gcc to produce results that I expected to get, yet, it still produces invalid output. If I try to modify working code this way:

    tmp buf;
    tmp * X = to_struct_tmp(&buf);
    

    then i get the same bad result! WOW, what else can I say? Perhaps, based on strict-aliasing rule gcc assumes that tmp * X isn't related to tmp buf and removed tmp buf as unused variable right after return from to_struct_tmp? Or does something strange that produces unexpected result. I also tried to inspect generated assembly, however, changing tmp * X = &buf; to tmp * X = to_struct_tmp(&buf); produces extremely different code for the function, so, somehow that aliasing rule affects code generation big time.

    Conclusion:
    After all kinds of testing, I have an idea why possibly I can't get it to work no matter what I try. Based on strict type aliasing, GCC thinks that the static array is unused and therefore doesn't allocate stack for it. Then, local variables that also use stack are written to the same location where my tmp struct is stored; in other words, my jumbo struct shares the same stack memory as other variables of the function. Only this could explain why it always results in the same bad result. -fno-strict-aliasing fixes the issue, as expected in this case.


  • Solution

  • Just disable alias-based optimization and call it a day

    If your problems are in fact caused by optimizations related to strict aliasing, then -fno-strict-aliasing will solve the problem. Additionally, in that case, you don't need to worry about losing optimization because, by definition, those optimizations are unsafe for your code and you can't use them.

    Good point by Praetorian. I recall one developer's hysteria prompted by the introduction of alias analysis in gcc. A certain Linux kernel author wanted to (A) alias things, and (B) still get that optimization. (That's an oversimplification but it seems like -fno-strict-aliasing would solve the problem, not cost much, and they all must have had other fish to fry.)