arrays performance accelerate-framework vdsp

Why does order of array declaration affect performance so much?

First, in tuning a frequency analysis function using the Accelerate framework, the absolute system time has consistently been 225ms per iteration. Then last night I changed the order of which two of the arrays were declared and suddenly it went down to 202ms. A 10% increase by just changing the declaration order seems insane. Can someone explain to me why the compiler (which is set to optimize) is not already finding this solution?

Additional info: Before the loop there is some setup of the arrays used in the loop consisting of converting them from integer to float arrays (for Accelerate) and then taking sin and cos of the time array (16 lines long). All of the float arrays (8 arrays x 1000 elements) are declared first in the function (after a sanity check of the parameters). They are always declared the same size (by a constant), because otherwise performance suffered for little shrinkage of the footprint. I tested making them globals, but I think the compiler already figured that out as there is no performance change. The loop is 25 lines long.

---Additions---

Yes, "-Os" is the flag. (default in Xcode anyways: Fastest, Smallest)

(below is from memory - don't try to compile it, cause I didn't put in things like stride (which is 1), etc. However, all of the Accelerate calls are there)

passed parameters: inttimearray, intamparray, length, scale1, scale2, amp

float trigarray1[maxsize];
float trigarray2[maxsize];
float trigarray3[maxsize];
float trigarray4[maxsize];
float trigarray5[maxsize];
float temparray[maxsize];
float amparray[maxsize];    //these two make the most change
float timearray[maxsize];    //these two make the most change

vDSP_vfltu32(inttimearray,timearray,length); //convert to float array
vDSP_vflt16(intamparray,amparray,length);    //convert to float array

vDSP_vsmul(timearray,scale1,temparray,length);    //scale time and store in temp
vvcosf(temparray,trigarray3,length);     //cos of temparray
vvsinf(temparray,trigarray4,length);     //sin of temparray
vDSP_vneg(trigarray4,trigarray5,length); //negative of trigarray4

vDSP_vsmul(timearray,scale2,temparray,length); //scale time and store in temp
vvcosf(temparray,trigarray1,length);           //cos of temparray
vvsinf(temprray,trigarray2,length);            //sin of temparray

float ysum;
vDSP_sve(amparray,ysum,length);    //sum of amparray

float csum, ssum, ccsum, sssum, cssum, ycsum, yssum;

for (i = 0; i<max; i++) {

    vDSP_sve(trigarray1,csum,length);    //sum of trigarray1
    vDSP_sve(trigarray2,ssum,length);    //sum of trigarray2

    vDSP_svesq(trigarray1,ccsum,length); //sum of trigarray1^2
    vDSP_svesq(trigarray2,sssum,length); //sum of trigarray2^2

    vDSP_vmul(trigarray1,trigarray2,temparray,length); //temp = trig1*trig2
    vDSP_sve(temparray,cssum,length);                  //sum of temp array
    // 2 more sets of the above 2 lines, for the 2 remaining sums

    amp[i] = (arithmetic of sums);

    //trig identity to increase the sin/cos by a delta frequency
    //vmma is a*b+c*d=result
    vDSP_vmma (trigarray1,trigarray3,trigarray2,trigarray4,temparray,length);
    vDSP_vmma (trigarray2,trigarray3,trigarray1,trigarray5,trigarray2,length);
    memcpy(trigarray1,temparray,length*sizeof(float));
}

---Current Solution---

I've made some changes as follows:

The arrays are all declared aligned, and zero'd out (I'll explain next) and maxsize is now a multiple of 16

__attribute__ ((align (16))) float timearray[maxsize] = {0};

I've zero'd out all of the arrays because now, when the length is less than maxsize, I round the length up to the nearest multiple of 16 so that all of the looped functions operate on widths divisible by 16, without affecting the sums.

The benefits are:

Slight performance boost
The speed is nearly constant regardless of order of array declaration (which is now done right before they are needed, instead of all in a big block)
The speed is also nearly constant for any 16-wide length (i.e. 241 to 256, or 225 to 240...), whereas before, if the length went from 256 to 255, the function would take a 3+% performance hit.

In the future (possibly with this code, as analysis requirements are still in flux), I realize I'll need to take into consideration stack usage more, and alignment/chunks of vectors. Unfortunately, for this code, I can't make these arrays static or globals as this function can be called by more than one object at a time.

Solution

The first thing I would suspect is alignment. You may want to experiment with:

__attribute__ ((align (16))) float ...[maxsize];

Or make sure that maxsize is a multiple of 16. That could definitely cause a 10% hit if in one configuration you're aligned and in another you're not. Vector operations can be extremely sensitive to this.

The next major issue you may have is a huge stack (assuming maxsize is fairly large). ARM can deal with numbers less than 4k much more efficiently than it can deal with numbers larger than 4k (because it can only deal with 12-bit immediate values). So depending on the how the compiler has optimized it, pushing amparray way down on the stack could lead to more complicated math to access it.

When small twiddly things lead to big performance changes, I always recommend pulling up the assembly (Product>Generate Output>Assembly) and seeing what's changes in the compiler output. I also highly recommend Whirlwind Tour of ARM Assembly to get you started understanding what you're looking at. (Make sure you set the output to "For Archiving" so you see the optimized result.)

You should also do a few more things:

Try rewriting this routine as simple C instead of using Accelerate. Yes, I know Accelerate is always faster, except it's not. All those function calls are quite expensive, and the compiler can often better vectorize simple multiplication and addition that Accelerate can in my experience. This is particularly true if your stride is 1, your vectors are not enormous, and you're on a 1-2 core device like an iPad. The moment you have code that handles a stride (if you don't need a stride), it's more complicated (slower) than the code you would have written by hand. In my experience, Accelerate does seem to be very good at ramps and transcendentals (cosines of big tables for example), but not nearly so good at simple vector and matrix math.
If this code really matters to you, I've found that hand-writing the assembly can definitely out-pace the compiler. I'm not even that good at ARM assembler, and I've been able to beat the compiler by 2x on simple matrix math (and the compiler crushed Accelerate). I'm particularly talking about your loop here that seems to be doing just adds and multiplies. Handwriting the assembly is a pain of course, and you then have to maintain a C version for the assembler, but when it really matters it's really fast.