I am trying to speed up a single program by using prefetches. The purpose of my program is just for test. Here is what it does:
- It uses two int buffers of the same size
- It reads one-by-one all the values of the first buffer
- It reads the value at the index in the second buffer
- It sums all the values taken from the second buffer
- It does all the previous steps for bigger and bigger
- At the end, I print the number of voluntary and involuntary CPU
In the very first time, values in the first buffers contains the values of its index (cf. function createIndexBuffer
in the code just below) .
It will be more clear in the code of my program:
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/time.h>
#define BUFFER_SIZE ((unsigned long) 4096 * 100000)
unsigned int randomUint()
{
int value = rand() % UINT_MAX;
return value;
}
unsigned int * createValueBuffer()
{
unsigned int * valueBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
valueBuffer[i] = randomUint();
}
return (valueBuffer);
}
unsigned int * createIndexBuffer()
{
unsigned int * indexBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
indexBuffer[i] = i;
}
return (indexBuffer);
}
unsigned long long computeSum(unsigned int * indexBuffer, unsigned int * valueBuffer)
{
unsigned long long sum = 0;
for (unsigned int i = 0 ; i < BUFFER_SIZE ; i++)
{
unsigned int index = indexBuffer[i];
sum += valueBuffer[index];
}
return (sum);
}
unsigned int computeTimeInMicroSeconds()
{
unsigned int * valueBuffer = createValueBuffer();
unsigned int * indexBuffer = createIndexBuffer();
struct timeval startTime, endTime;
gettimeofday(&startTime, NULL);
unsigned long long sum = computeSum(indexBuffer, valueBuffer);
gettimeofday(&endTime, NULL);
printf("Sum = %llu\n", sum);
free(indexBuffer);
free(valueBuffer);
return ((endTime.tv_sec - startTime.tv_sec) * 1000 * 1000) + (endTime.tv_usec - startTime.tv_usec);
}
int main()
{
printf("sizeof buffers = %ldMb\n", BUFFER_SIZE * sizeof(unsigned int) / (1024 * 1024));
unsigned int timeInMicroSeconds = computeTimeInMicroSeconds();
printf("Time: %u micro-seconds = %.3f seconds\n", timeInMicroSeconds, (double) timeInMicroSeconds / (1000 * 1000));
}
If I launch it, I get the following output:
$ gcc TestPrefetch.c -O3 -o TestPrefetch && ./TestPrefetch
sizeof buffers = 1562Mb
Sum = 439813150288855829
Time: 201172 micro-seconds = 0.201 seconds
Quick and fast!!! According to my knowledge (I may be wrong), one of the reason for having such a fast program is that, as I access my two buffers sequentially, data can be prefetched in the CPU cache.
We can make it more complex in order that data is (almost) prefeched in CPU cache. For example, we can just change the createIndexBuffer
function in:
unsigned int * createIndexBuffer()
{
unsigned int * indexBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
indexBuffer[i] = rand() % BUFFER_SIZE;
}
return (indexBuffer);
}
Let's try the program once again:
$ gcc TestPrefetch.c -O3 -o TestPrefetch && ./TestPrefetch
sizeof buffers = 1562Mb
Sum = 439835307963131237
Time: 3730387 micro-seconds = 3.730 seconds
More than 18 times slower!!!
We now arrive to my problem. Given the new createIndexBuffer
function, I would like to speed up computeSum
function using prefetch
unsigned long long computeSum(unsigned int * indexBuffer, unsigned int * valueBuffer)
{
unsigned long long sum = 0;
for (unsigned int i = 0 ; i < BUFFER_SIZE ; i++)
{
__builtin_prefetch((char *) &indexBuffer[i + 1], 0, 0);
unsigned int index = indexBuffer[i];
sum += valueBuffer[index];
}
return (sum);
}
of course I also have to change my createIndexBuffer
in order it allocates a buffer having one more element
I relaunch my program: not better! As prefetch may be slower than one "for" loop iteration, I may prefetch not one element before but two elements before
__builtin_prefetch((char *) &indexBuffer[i + 2], 0, 0);
not better! two loops iterations? not better? Three? **I tried it until 50 (!!!) but I cannot enhance the performance of my function computeSum
.
Can I would like help to understand why Thank you very much for your help
I believe that above code is automatically optimized by CPU without any further space for manual optimization.
1. Main problem is that indexBuffer
is sequentially accessed. Hardware prefetcher senses it and prefetches further values automatically, without need to call prefetch manually. So, during iteration #i, values indexBuffer[i+1]
, indexBuffer[i+2]
,... are already in cache. (By the way, there is no need to add artificial element to the end of array: memory access errors are silently ignored by prefetch instructions).
What you really need to do is to prefetch valueBuffer
instead:
__builtin_prefetch((char *) &valueBuffer[indexBuffer[i + 1]], 0, 0);
2. But adding above line of code won't help either in such simple scenario. Cost of accessing memory is hundreds of cycles, while add instruction is ~1 cycle. Your code already spends 99% of time in memory accesses. Adding manual prefetch will make it this one cycle faster and no better.
Manual prefetch would really work well if your math were much more heavy (try it), like using an expression with large number of non-optimized out divisions (20-30 cycles each) or calling some math function (log, sin).
3. But even this doesn't guarantee to help. Dependency between loop iterations is very weak, it is only via sum
variable. This allows CPU to execute instructions speculatively: it may start fetching valueBuffer[i+1]
concurrently while still executing math for valueBuffer[i]
.