I was reading Knuth's The Art of Computer Programming and I noticed that he indicates that the DIV command takes 6 times longer than the ADD command in his MIX assembly language.
To test the relevancy to modern architecture, I wrote the following code snippet:
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
clock_t start;
unsigned int ia=0,ib=0,ic=0;
int i;
float fa=0.0,fb=0.0,fc=0.0;
int sample_size=100000;
if (argc > 1)
sample_size = atoi(argv[1]);
#define TEST(OP) \
start = clock();\
for (i = 0; i < sample_size; ++i)\
ic += (ia++) OP ((ib--)+1);\
printf("%d,", (int)(clock() - start))
TEST(+);
TEST(*);
TEST(/);
TEST(%);
TEST(>>);
TEST(<<);
TEST(&);
TEST(|);
TEST(^);
#undef TEST
//TEST must be redefined for floating point types
#define TEST(OP) \
start = clock();\
for (i = 0; i < sample_size; ++i)\
fc += (fa+=0.5) OP ((fb-=0.5)+1);\
printf("%d,", (int)(clock() - start))
TEST(+);
TEST(*);
TEST(/);
#undef TEST
printf("\n");
return ic+fc;//to prevent optimization!
}
I then generated 4000 test samples (each containing a sample size of 100000 operations of each type) using this command line:
for i in {1..4000}; do ./test >> output.csv; done
Finally, I opened the results with Excel and graphed the averages. What I found was rather surprising. Here is the graph of the results:
The actual averages were (from left-to-right): 463.36475,437.38475,806.59725,821.70975,419.56525,417.85725,426.35975,425.9445,423.792,549.91975,544.11825,543.11425
Overall this is what I expected (division and modulo are slow, as are floating point results).
My question is: why do both integer and floating-point multiplication execute faster than their addition counterparts? It is a small factor, but it is consistent across numerous tests. In TAOCP Knuth lists ADD as taking 2 units of time while MUL takes 10. Did something change in CPU architecture since then?
to test the execution time, look at the instructions produced in the assembly listing and look at the documentation for the processor for those instructions and note if the FPU is performing the operation or if it is directly performed in the code.
Then, add up the execution time for each instruction.
However, if the cpu is pipelined or multi threaded, the operation could take MUCH less time than calculated.