I am performing some execution time benchmarks for my implementation of quicksort. Out of 100 successive measurements on exactly the same input data it seems like the first call to quicksort takes roughly 10 times longer than all consecutive calls. Is this a consequence of the operating system getting ready to execute the program, or is there some other explanation? Moreover, is it reasonable to discard the first measurement when computing an average runtime?
The below bar chart illustrates execution time (miliseconds) versus method call number. Each time the method is called it processes the exact same data.
To produce this particular graph the main method makes a call to quicksort_timer::time_fpi_quicksort(5, 100)
whose implementation can be seen below.
static void time_fpi_quicksort(int size, int runs)
{
std::vector<int> vector(size);
for (int i = 0; i < runs; i++)
{
vector = utilities::getRandomIntVectorWithConstantSeed(size);
Timer timer;
quicksort(vector, ver::FixedPivotInsertion);
}
}
The getRandomIntVectorWithConstantSeed
is implemented as follows
std::vector<int> getRandomIntVectorWithConstantSeed(int size)
{
std::vector<int> vector(size);
srand(6475307);
for (int i = 0; i < size; i++)
vector[i] = rand();
return vector;
}
CPU and Compilation
CPU: Broadwell 2.7 GHz Intel Core i5 (5257U)
Compiler Version: Apple LLVM version 10.0.0 (clang-1000.11.45.5)
Compiler Options: -std=c++17 -O2 -march=native
Yes, it could be a page fault on the page holding the code for the sort function (and the timing code itself). The 10x could also include ramp-up to max turbo clock speed.
Caching is not plausible, though: you're writing the (tiny) array outside the timed region, unless the compiler somehow reordered the init with the constructor of your Timer
. Memory allocation being much slower the first time would easily explain it, maybe having to make a system call to get a new page the first time, but later calls to new
(to construct std::vector) just grabbing already-hot-in-cache memory from the free list.
Training the branch predictors could also be a big factor, but you'd expect it to take more than 1 run before the TAGE branch predictors in a modern Intel CPU, or the perceptron predictors in a modern AMD, "learned" the full pattern of all the branching. But maybe they get close after the first run.
Note that you produce the same random array every time, by using srand()
on every call. To test if branch prediction is the explanation, remove the srand
so you get different arrays every time, and see if the time stays much higher.
What CPU, compiler version / options, etc. are you using?