I am trying to find a way to cache my array of elements for an application that uses Closest Pair algorithm (brute force for the time being). According to The Cache Performance and Optimization of Blocked Algorithm paper it says:
Blocking is a general optimization technique for increasing the effectiveness of a memory hierarchy. By reusing data in the faster level of the hierarchy, it cuts down the average access latency. It also reduces the number of references made to slower levels of the hierarchy. Blocking is thus superior to optimization such as prefetching, which hides the latency but does not reduce the memory bandwidth requirement. This reduction is especially important for multiprocessors since memory bandwidth is often the bottleneck of the system. Blocking has been shown to be useful for many algorithms in linear algebra.
The paper gives a matrix multiplication code and it is modified to blocking with reduced cache misses:
for kk = 1 to N by B
for j = 1 to N by B
for i = 1 to N
for k = kk to min(kk + B-1, N)
r = X[i,k]; // register allocated
for j = jj to min(jj + B-1, N)
Z[i,j] += r * Y[k,j];
Here, B is the blocking factor but how can we determine this? Is there a generic way to find the specific limit that a cpu cache can handle? Probably not all cpu have the same cache. The generic procedure says:
The closest pair algorithm (brute force) is:
minDist = infinity
for i = 1 to length(P) - 1
for j = i + 1 to length(P)
let p = P[i], q = P[j]
if dist(p, q) < minDist:
minDist = dist(p, q)
closestPair = (p, q)
return closestPair
To sum up:
Thanks in advance!
First question:
There is no simple method of determining B without actually testing it on the machine you intend to optimize for. Having said that, you can probably, with some experimenting find some "good for most systems" numbers (I worked quite a bit on this sort of thing some 12-15 years back), and I found using something around 8-16KB blocks worked quite well. The effect is quite dramatic between "just run through all the memory as it comes" and "work through blocks", and if you start with really small blocks, you can see some great improvement as you start going bigger. Then the "return" drops off, until you reach the level where B is so large that you are back where you started (throwing out good cache to fetch something else that you won't ever use before it's been thrown out).
I'm pretty sure that if you work through a selection of "sizes" of B for your code, and testing what performance you get, and if you plot the graph, you'll probably find that it looks like a "bathtub" if you plot the "time taken" (or upside down bathtub, if you plot "number of items processed per unit of time"). Just find some point in the "flat" part of the bathtub. Do however try it on a few different machines, just to make sure that you are in the "flat bit" on all (or at least most) machines.
For your second question, something like this:
minDist = infinity
for i = 1 to length(P) - 1 by B
for j = i + 1 to length(P) by B
for ib = i to i+B-1
for jb = j to j+B-1
let p = P[ib], q = P[jb]
if dist(p, q) < minDist:
minDist = dist(p, q)
closestPair = (p, q)
return closestPair
If length(P)
is not a multiple of B, there's a bit of extra work to deal with the last few elements, so instead of i+B-1
in the ib
loop you may need max(length(P), i+B-1)
and similar for the jb
loop.
Edit:
The cache will by itself decide what data is kept in the cache, and there is very little you can do to alter what happens here. What you can change is what blocks of data you are working over.
The key to "blocking" is to keep data that is being worked on in the (L1) cache.
Let's say the whole lock of data is 100000 elements of 4 bytes each, so about 400KB. That won't fit in the L1 cache of any modern processor, as it's at most 64KB, often 32KB. So when we loop over the items with i
, the j
loop will "throw out" all the good L1 cache content by loading the later parts of the array. And of course as the j
loop start over the next time, none of the data currently in the cache is useful, because it's all high indices of the array.
If we instead work through a little bit of the array at a time, in blocks, we can work through a squre of B size of the array in each loop - where B elements doesn't use more space than can fit in the cache. So the jb
loop doesn't throw out the data for the ib
loop (or vice versa). This means that each of the inner loops run significantly quicker (I've seen more than 3 times faster execution, and that's on code that was already supposed to be "good").