I am trying to optimize the closest pair brute force algorithm and compare it with the non cached program but I am stuck.
The main problem, is that I get worse performance when I cache the calculations with the for loops its almost cached time = 2 x non-cached time
. Its like nothing happens if i change the size of the block... I use a struct point array P for x,y cordinates
Here is the non-cached code:
void compare_points_BF(int *N, point *P){
int i, j, p1, p2;
float dx, dy, distance=0, min_dist=inf();
long calc = 0;
for (i=0; i<(*N-1) ; i++){
for (j=i+1; j<*N; j++){
dx = P[i].x - P[j].x;
dy = P[i].y - P[j].y;
//calculate distance of current points
distance = (dx * dx) + (dy * dy);
calc++;
if (distance < min_dist){
min_dist = distance;
p1 = i;
p2 = j;
}
}
}
printf("%ld calculations\t", calc);
}
And here is the cached version:
void compare_points_BF(int *N, int *B, point *P){
int i, j, ib, jb, p1, p2, num_blocks = (*N + (*B-1)) / (*B);
float dist=0, min_dist=inf();
long calc=0;
//break array data in N/B blocks
for (i = 0; i < num_blocks; i++){
for (j = i; j < num_blocks; j++){
for (jb = j * (*B); jb < min((j+1) * (*B), *N); jb++){
//avoid double comparisons that occur when i block = j block
for (i == j ? (ib = jb + 1) : (ib = i*(*B)); ib < min((i+1) * (*B), *N); ib++){
calc++;
//calculate distance of current points
if((dist = (P[ib].x - P[jb].x) * (P[ib].x - P[jb].x) +
(P[ib].y - P[jb].y) * (P[ib].y - P[jb].y)) < min_dist){
min_dist = dist;
p1 = ib;
p2 = jb;
}
}
}
}
}
printf("%ld calculations\t", calc);
}
For example the output of non-cached program is:
N = 8192 Run time: 0,080 sec
N = 16384 Run time: 0,330 sec
N = 32768 Run time: 1,280 sec
N = 65.536 Run time: 5,290 sec
N = 131.072 Run time: 21,290sec
N = 262.144 Run time: 81,880sec
N = 524.288 Run time: 327,460 sec
But with the cached example I get:
33550336 calculations Block_size = 128 N = 8192 Run time: 0.402 sec
33550336 calculations Block_size = 256 N = 8192 Run time: 0.383 sec
33550336 calculations Block_size = 512 N = 8192 Run time: 0.384 sec
33550336 calculations Block_size = 1024 N = 8192 Run time: 0.381 sec
33550336 calculations Block_size = 2048 N = 8192 Run time: 0.398 sec
33550336 calculations Block_size = 4096 N = 8192 Run time: 0.400 sec
33550336 calculations Block_size = 8192 N = 8192 Run time: 0.401 sec
33550336 calculations Block_size = 16384 N = 8192 Run time: 0.383 sec
134209536 calculations Block_size = 128 N = 16384 Run time: 1.579 sec
134209536 calculations Block_size = 256 N = 16384 Run time: 1.610 sec
134209536 calculations Block_size = 512 N = 16384 Run time: 1.630 sec
134209536 calculations Block_size = 1024 N = 16384 Run time: 1.530 sec
134209536 calculations Block_size = 2048 N = 16384 Run time: 1.537 sec
134209536 calculations Block_size = 4096 N = 16384 Run time: 1.562 sec
134209536 calculations Block_size = 8192 N = 16384 Run time: 1.520 sec
134209536 calculations Block_size = 16384 N = 16384 Run time: 1.626 sec
536854528 calculations Block_size = 128 N = 32768 Run time: 6.170 sec
536854528 calculations Block_size = 256 N = 32768 Run time: 6.207 sec
536854528 calculations Block_size = 512 N = 32768 Run time: 6.219 sec
536854528 calculations Block_size = 1024 N = 32768 Run time: 6.131 sec
536854528 calculations Block_size = 2048 N = 32768 Run time: 6.077 sec
536854528 calculations Block_size = 4096 N = 32768 Run time: 6.216 sec
536854528 calculations Block_size = 8192 N = 32768 Run time: 6.130 sec
536854528 calculations Block_size = 16384 N = 32768 Run time: 6.181 sec
I have checked over and over the code and it seems to be correct. What am I missing here? Does the compiler optimize the code to implement better cache usage than what I am trying to achieve? Thanks in advance!
Been a long time but just to answer-close the question.
float compare_points_BF(register int N, register int B, point *P, register point *p1, *p2;){
register int i, j, ib, jb, iin, jjn, num_blocks = (N + (B-1)) / B;
register float distance=0, min_dist=FLT_MAX, regx, regy;
//break array data in N/B blocks
for (i = 0; i < num_blocks; i++){
for (j = i; j < num_blocks; j++){
iin = ( ((i+1)*B) < N ? ((i+1)*B) : N);
jjn = (((j+1)*B) < N ? ((j+1)*B) : N);
//reads the moving frame block to compare with the i block
for (jb = j * B; jb < jjn; jb++){
//avoid float comparisons that occur when i block = j block
//Registers Allocated
regx = P[jb].x;
regy = P[jb].y;
for (i==j ? (ib=jb+1):(ib=i*B); ib < iin; ib++){
//calculate distance of current points
if((distance = (P[ib].x - regx) * (P[ib].x - regx) +
(P[ib].y - regy) * (P[ib].y - regy)) < min_dist){
min_dist = distance;
p1 = &P[ib];
p2 = &P[jb];
}
}
}
}
}
return sqrt(min_dist);
}
This use of cache can speed up to 10%. Here are some results.
Block
Size Number of elements
8192 16384 32768 65536 131072 262144 524288 1048576
128 0,079 0,310 1,260 4,960 19,740 78,990 315,661 1.260,862
256 0,079 0,310 1,250 4,940 19,830 78,820 315,410 1.258,402
512 0,080 0,320 1,260 4,920 19,640 78,480 313,851 1.253,141
1024 0,080 0,320 1,250 4,870 19,430 77,540 310,120 1.237,772
2048 0,079 0,310 1,240 4,850 19,340 77,061 308,211 1.229,892
4096 0,079 0,300 1,210 4,890 19,670 78,300 313,310 1.250,572
8192 0,078 0,310 1,210 4,870 19,510 78,110 312,770 1.249,091
16384 0,300 1,200 4,860 19,420 77,870 312,151 1.246,192
32768 1,190 4,780 19,310 77,460 310,970 1.242,102
65536 4,760 19,230 77,660 312,191 1.249,872
131072 18,972 76,850 310,470 1.246,261
262144 76,400 307,521 1.239,402