Search code examples
ctraversalmemory-bandwidth

I'm not seeing performance boost while using optimised memory bandwidth method


I was presented example of a loop which should be slower than the one after this:

for (i = 0; i < 1000; i++) 
   column_sum[i] = 0.0;
     for (j = 0; j < 1000; j++)
        column_sum[i] += b[j][i];

Comparing to this one:

for (i = 0; i < 1000; i++)
     column_sum[i] = 0.0;
for (j = 0; j < 1000; j++)
     for (i = 0; i < 1000; i++)
        column_sum[i] += b[j][i];

Now, I coded a tool to test number of different index numbers, but I am not seeing much of performance advantage there after I tried this concept, and I'm afraid that my code has something to do with it...

Should be slower loop that works within my code:

    for (i = 0; i < val; i++){
        column_sum[i] = 0.0;
        for (j = 0; j < val; j++){
            int index = i * (int)val + j;
            column_sum[i] += p[index];
        }
    }

Should be "significantly" faster code:

    for (i = 0; i < val; i++) {
        column_sum[i] = 0.0;
    }
    for (j = 0; j < val; j++) {
        for (i = 0; i < val; i++) {
            int index = j * (int)val + i;
            column_sum[i] += p[index];
        }
    }

Data comparison:

enter image description here


Solution

  • I had confused the Index values in the loops: int index = j * (int)val + i;

    Slower loop:

        for (i = 0; i < val; i++) {
            column_sum[i] = 0.0;
            for (j = 0; j < val; j++){
                int index = j * (int)val + i;
                column_sum[i] += p[index];
            }
        }
    

    Faster loop:

        for (i = 0; i < val; i++) {
            column_sum[i] = 0.0;
        }
        for (j = 0; j < val; j++) {
            for (i = 0; i < val; i++) {
                int index = j * (int)val + i;
                column_sum[i] += p[index];
            }
        }