How to reduce the access time in a vector

I´m trying to increase an algorithm speed, So I ran my application with "Instruments" for iOS, the results, almost 75% of time is used to save the calculations in a vector.

Does anyone know a better way to save the data without consuming so quantity of CPU? I suppose is related with the access to cache memory or something like that. The line is marked with a comment, in this line is saved a short in an array of shorts.

short XY[32*32*2]
Mat _XY(bh, bw, CV_16SC2, XY), matA;
Mat dpart(dst, Rect(x, y, bw, bh));

for( y1 = 0; y1 < bh; y1++ )
{
    short* xy = XY + y1*bw*2;
    int X0    = M[0]*x + M[1]*(y + y1) + M[2];
    int Y0    = M[3]*x + M[4]*(y + y1) + M[5];
    float W0  = M[6]*x + M[7]*(y + y1) + M[8];

    M2[2] = X0;
    M2[3] = Y0;

    for(x1=0; x1<bw; x1++)
    {

        float W      = W0 + M[6]*x1;
        W            = 1./W;
        float x12[2] = {x1*W,W};


        matvec2_c(M2,x12,M3);
        short aux    = (M3[0]);
        int aux2     = x1*2;
        xy[aux2]     = aux;          // %60 CPU TIME
        xy[x1*2+1]   = (M3[1]);      // 11% CPU TIME
    }
    // ...
}

void matvec2_c(float m[4], float v[2], float d[2])
{
    d[0] = m[0]*v[0] + m[2]*v[1];
    d[1] = m[1]*v[0] + m[3]*v[1];
}

Solution

This was the best I could do it:

short XY[32*32*2];
int XYI[32*32*2];
Mat _XY(bh, bw, CV_16SC2, XY), matA;
Mat _XYI(bh, bw, CV_32S, XYI);
Mat dpart(dst, Rect(x, y, bw, bh));

 for( y1 = 0; y1 < bh; y1++ )
 {
    int * xyi = XYI + y1*bw;
    short * xy = XY + y1*bw*2;

    int X0 = M[0]*x + M[1]*(y + y1) + M[2];
    int Y0 = M[3]*x + M[4]*(y + y1) + M[5];

    float W0 = M[6]*x + M[7]*(y + y1) + M[8];
    M2[2]=X0;
    M2[3]=Y0;

    for(x1=0;x1<bw;x1++){

    float W = W0 + M[6]*x1;
    W= 1./W;
    float x12[2]={x1*W,W};


    matvec2_c(M2,x12,M3);

    xyi[x1*2] = (M3[0]);//9% 
    xyi[x1*2+1]=(M3[1]);//6%



}
for(x1=0;x1<bw;x1++){

  xy[x1*2] = xyi[x1*2];//4%
  xy[x1*2+1]=xyi[x1*2+1];//3%
}

I just split the part where the code saves the equation in two parts, so I suppose it is something related with the way the cpu acces to the cache or maybe something related with the different formats. The algorithm time decrease from 93 ms to 78 ms.