I have performance critical code and there is a huge function that allocates like 40 arrays of different size on the stack at the beginning of the function. Most of these arrays have to have certain alignment (because these arrays are accessed somewhere else down the chain using cpu instructions that require memory alignment (for Intel and arm CPUs).
Since some versions of gcc simply fail to align stack variables properly (notably for arm code), or even sometimes it says that maximum alignment for the target architecture is less than what my code actually requests, I simply have no choice but to allocate these arrays on the stack and align them manually.
So, for each array I need to do something like that to get it aligned properly:
short history_[HIST_SIZE + 32];
short * history = (short*)((((uintptr_t)history_) + 31) & (~31));
This way, history
is now aligned on 32-byte boundary. Doing the same is tedious for all 40 arrays, plus this part of code is really cpu intensive and I simply cannot do the same alignment technique for each of the arrays (this alignment mess confuses the optimizer and different register allocation slows down the function big time, for better explanation see explanation at the end of the question).
So... obviously, I want to do that manual alignment only once and assume that these arrays are located one right after the other. I also added extra padding to these arrays so that they are always multiple of 32 bytes. So, then I simply create a jumbo char array on the stack and cast it to a struct that has all these aligned arrays:
struct tmp
{
short history[HIST_SIZE];
short history2[2*HIST_SIZE];
...
int energy[320];
...
};
char buf[sizeof(tmp) + 32];
tmp * X = (tmp*)((((uintptr_t)buf) + 31) & (~31));
Something like that. Maybe not the most elegant, but it produced really good result and manual inspection of generated assembly proves that generated code is more or less adequate and acceptable. Build system was updated to use newer GCC and suddenly we started to have some artifacts in generated data (e.g. output from validation test suite is not bit exact anymore even in pure C build with disabled asm code). It took long time to debug the issue and it appeared to be related to aliasing rules and newer versions of GCC.
So, how can I get it done? Please, don't waste time trying to explain that it's not standard, not portable, undefined etc (I've read many articles about that). Also, there is no way I can change the code (I would perhaps consider modifying GCC as well to fix the issue, but not refactoring the code)... basically, all I want is to apply some black magic spell so that newer GCC produces the functionally same code for this type of code without disabling optimizations?
Edit:
In short, the point of the question... how can I allocate random amount of stack space (using char arrays or alloca
, and then align pointer to that stack space and reinterpret this chunk of memory as some structure that has some well defined layout that guarantees alignment of certain variables as long as the structure itself is aligned properly. I'm trying to cast the memory using all kinds of approaches, I move the big stack allocation to a separate function, still I get bad output and stack corruption, I'm really starting to think more and more that this huge function hits some kind of bug in gcc. It's quite strange, that by doing this cast I can't get this thing done no matter what I try. By the way, I disabled all optimizations that require any alignment, it's pure C-style code now, still I get bad results (non-bitexact output and occasional stack corruptions crashes). The simple fix that fixes it all, I write instead of:
char buf[sizeof(tmp) + 32];
tmp * X = (tmp*)((((uintptr_t)buf) + 31) & (~31));
this code:
tmp buf;
tmp * X = &buf;
then all bugs disappear! The only problem is that this code doesn't do proper alignment for the arrays and will crash with optimizations enabled.
Interesting observation:
I mentioned that this approach works well and produces expected output:
tmp buf;
tmp * X = &buf;
In some other file I added a standalone noinline function that simply casts a void pointer to that struct tmp*:
struct tmp * to_struct_tmp(void * buffer32)
{
return (struct tmp *)buffer32;
}
Initially, I thought that if I cast alloc'ed memory using to_struct_tmp it will trick gcc to produce results that I expected to get, yet, it still produces invalid output. If I try to modify working code this way:
tmp buf;
tmp * X = to_struct_tmp(&buf);
then i get the same bad result! WOW, what else can I say? Perhaps, based on strict-aliasing rule gcc assumes that tmp * X
isn't related to tmp buf
and removed tmp buf
as unused variable right after return from to_struct_tmp? Or does something strange that produces unexpected result. I also tried to inspect generated assembly, however, changing tmp * X = &buf;
to tmp * X = to_struct_tmp(&buf);
produces extremely different code for the function, so, somehow that aliasing rule affects code generation big time.
Conclusion:
After all kinds of testing, I have an idea why possibly I can't get it to work no matter what I try. Based on strict type aliasing, GCC thinks that the static array is unused and therefore doesn't allocate stack for it. Then, local variables that also use stack are written to the same location where my tmp
struct is stored; in other words, my jumbo struct shares the same stack memory as other variables of the function. Only this could explain why it always results in the same bad result. -fno-strict-aliasing fixes the issue, as expected in this case.
If your problems are in fact caused by optimizations related to strict aliasing, then -fno-strict-aliasing
will solve the problem. Additionally, in that case, you don't need to worry about losing optimization because, by definition, those optimizations are unsafe for your code and you can't use them.
Good point by Praetorian. I recall one developer's hysteria prompted by the introduction of alias analysis in gcc. A certain Linux kernel author wanted to (A) alias things, and (B) still get that optimization. (That's an oversimplification but it seems like -fno-strict-aliasing
would solve the problem, not cost much, and they all must have had other fish to fry.)