Implementing caching technique in closest pair algorithm

I am trying to optimize the closest pair brute force algorithm and compare it with the non cached program but I am stuck.

The main problem, is that I get worse performance when I cache the calculations with the for loops its almost cached time = 2 x non-cached time. Its like nothing happens if i change the size of the block... I use a struct point array P for x,y cordinates

Here is the non-cached code:

void compare_points_BF(int *N, point *P){
    int i, j, p1, p2;
    float dx, dy, distance=0, min_dist=inf();
    long calc = 0;

    for (i=0; i<(*N-1) ; i++){
        for (j=i+1; j<*N; j++){
            dx = P[i].x - P[j].x;
            dy = P[i].y - P[j].y;
            //calculate distance of current points
            distance = (dx * dx) + (dy * dy);
            calc++;
            if (distance < min_dist){
                min_dist = distance;
                p1 = i;
                p2 = j;
            }
        }
    }
    printf("%ld calculations\t", calc);
}

And here is the cached version:

    void compare_points_BF(int *N, int *B, point *P){
        int i, j, ib, jb, p1, p2, num_blocks = (*N + (*B-1)) / (*B);
        float dist=0, min_dist=inf();
        long calc=0;
    
        //break array data in N/B blocks
        for (i = 0; i < num_blocks; i++){
            for (j = i; j < num_blocks; j++){
                for (jb = j * (*B); jb < min((j+1) * (*B), *N); jb++){
                    //avoid double comparisons that occur when i block = j block
                    for (i == j ? (ib = jb + 1) : (ib = i*(*B)); ib < min((i+1) * (*B), *N); ib++){
                        calc++;
                        //calculate distance of current points
                        if((dist = (P[ib].x - P[jb].x) * (P[ib].x - P[jb].x) +
                                (P[ib].y - P[jb].y) * (P[ib].y - P[jb].y)) < min_dist){
                            min_dist = dist;
                            p1 = ib;
                            p2 = jb;
                        }
                    }
                }
            }
        }
        printf("%ld calculations\t", calc);
}

For example the output of non-cached program is:

N = 8192     Run time: 0,080 sec
N = 16384    Run time: 0,330 sec
N = 32768    Run time: 1,280 sec
N = 65.536   Run time: 5,290 sec
N = 131.072  Run time: 21,290sec
N = 262.144  Run time: 81,880sec
N = 524.288  Run time: 327,460 sec

But with the cached example I get:

33550336 calculations   Block_size = 128    N = 8192    Run time: 0.402 sec
33550336 calculations   Block_size = 256    N = 8192    Run time: 0.383 sec
33550336 calculations   Block_size = 512    N = 8192    Run time: 0.384 sec
33550336 calculations   Block_size = 1024   N = 8192    Run time: 0.381 sec
33550336 calculations   Block_size = 2048   N = 8192    Run time: 0.398 sec
33550336 calculations   Block_size = 4096   N = 8192    Run time: 0.400 sec
33550336 calculations   Block_size = 8192   N = 8192    Run time: 0.401 sec
33550336 calculations   Block_size = 16384  N = 8192    Run time: 0.383 sec

134209536 calculations  Block_size = 128    N = 16384   Run time: 1.579 sec
134209536 calculations  Block_size = 256    N = 16384   Run time: 1.610 sec
134209536 calculations  Block_size = 512    N = 16384   Run time: 1.630 sec
134209536 calculations  Block_size = 1024   N = 16384   Run time: 1.530 sec
134209536 calculations  Block_size = 2048   N = 16384   Run time: 1.537 sec
134209536 calculations  Block_size = 4096   N = 16384   Run time: 1.562 sec
134209536 calculations  Block_size = 8192   N = 16384   Run time: 1.520 sec
134209536 calculations  Block_size = 16384  N = 16384   Run time: 1.626 sec

536854528 calculations  Block_size = 128    N = 32768   Run time: 6.170 sec
536854528 calculations  Block_size = 256    N = 32768   Run time: 6.207 sec
536854528 calculations  Block_size = 512    N = 32768   Run time: 6.219 sec
536854528 calculations  Block_size = 1024   N = 32768   Run time: 6.131 sec
536854528 calculations  Block_size = 2048   N = 32768   Run time: 6.077 sec
536854528 calculations  Block_size = 4096   N = 32768   Run time: 6.216 sec
536854528 calculations  Block_size = 8192   N = 32768   Run time: 6.130 sec
536854528 calculations  Block_size = 16384  N = 32768   Run time: 6.181 sec

I have checked over and over the code and it seems to be correct. What am I missing here? Does the compiler optimize the code to implement better cache usage than what I am trying to achieve? Thanks in advance!

Solution

Been a long time but just to answer-close the question.

float compare_points_BF(register int N, register int B, point *P, register point *p1, *p2;){

register int i, j, ib, jb, iin, jjn, num_blocks = (N + (B-1)) / B;
register float distance=0, min_dist=FLT_MAX, regx, regy;

    //break array data in N/B blocks
    for (i = 0; i < num_blocks; i++){
        for (j = i; j < num_blocks; j++){
          iin = ( ((i+1)*B) < N ? ((i+1)*B) : N);
          jjn = (((j+1)*B) < N ? ((j+1)*B) : N);
          //reads the moving frame block to compare with the i block
          for (jb = j * B; jb < jjn; jb++){
            //avoid float comparisons that occur when i block = j block
            //Registers Allocated
            regx = P[jb].x;
            regy = P[jb].y;
            for (i==j ? (ib=jb+1):(ib=i*B); ib < iin; ib++){
               //calculate distance of current points
               if((distance = (P[ib].x - regx) * (P[ib].x - regx) +
                        (P[ib].y - regy) * (P[ib].y - regy)) < min_dist){
                min_dist = distance;
                p1 = &P[ib];
                p2 = &P[jb];
               }
             }
           }
         }
    }
    return sqrt(min_dist);
}

This use of cache can speed up to 10%. Here are some results.

Block
Size    Number of elements
        8192    16384   32768   65536   131072  262144  524288  1048576
128     0,079   0,310   1,260   4,960   19,740  78,990  315,661 1.260,862
256     0,079   0,310   1,250   4,940   19,830  78,820  315,410 1.258,402
512     0,080   0,320   1,260   4,920   19,640  78,480  313,851 1.253,141
1024    0,080   0,320   1,250   4,870   19,430  77,540  310,120 1.237,772
2048    0,079   0,310   1,240   4,850   19,340  77,061  308,211 1.229,892
4096    0,079   0,300   1,210   4,890   19,670  78,300  313,310 1.250,572
8192    0,078   0,310   1,210   4,870   19,510  78,110  312,770 1.249,091
16384           0,300   1,200   4,860   19,420  77,870  312,151 1.246,192
32768                   1,190   4,780   19,310  77,460  310,970 1.242,102
65536                           4,760   19,230  77,660  312,191 1.249,872
131072                                  18,972  76,850  310,470 1.246,261
262144                                          76,400  307,521 1.239,402