In the code pasted below, I am running two functions f1
and f2
which perform the same exact job. Take a number T
and integer array arr
(that has been initialized to 0
everywhere) and then increment T
times each element of arr
Thus by the end of both f1
and f2
, the input 0,0...0
should become T,T...T
.
What I don't understand is why f1
runs so much slower than f2
(about 1.76 times slower when T
is, say, 1 billion). Here is my output
➜ Desktop gcc timing-difference.c && ./a.out 1000000000 16
-----> Running f1 <------- Time taken: 15.511 seconds
-----> Running f2 <------- Time taken: 8.887 seconds%
Here T
is supplied to the program as argv[1]
and the array length as argv[2]
. The timing-difference.c
file is pasted below at the end of this post.
Basically, f1
is directly incrementing each arr[i]
, whereas f2
is using a temporary variable tmp
for the increment, and then assigning it to arr[i] when done.
#include<stdio.h>
#include<unistd.h>
#include<stdlib.h>
#include<time.h>
/*Directly increment arr[i] for each i*/
void f1(int T, int* arr, int arrlength){
printf("\n-----> Running f1 <-------\n");
clock_t end, start;
double cpu_time_used;
// For each element of arr, increment it T times.
start = clock();
for(int i = 0 ; i<arrlength ;++i){
for (int j=0 ; j<T ; ++j){
arr[i] += 1;
}
}
end = clock();
cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("Time taken: %.3f seconds", cpu_time_used);
}
/*increment arr[i] using temporary variable tmp*/
void f2(int T, int* arr, int arrlength){
printf("\n-----> Running f2 <-------\n");
clock_t end, start;
double cpu_time_used;
// For each element of arr, increment it T times.
start = clock();
for(int i = 0 ; i<arrlength ;++i){
int tmp=arr[i];
for (int j=0 ; j<T ; ++j){
tmp += 1;
}
arr[i] = tmp;
}
end = clock();
cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("Time taken: %.3f seconds", cpu_time_used);
}
void print_arr(int* arr, int arrlength){
for(int i = 0 ; i<arrlength ; ++i){
printf("%d ", arr[i]);
}
printf("\n\n");
}
/*Zero initialize array*/
void initialize_array(int* arr, int arrlength){
for (int i=0 ; i<arrlength ; ++i){
arr[i] = 0;
}
}
int main(int argc, char** argv){
int T = atoi(argv[1]);
int arrlength = atoi(argv[2]);
int arr[arrlength];
initialize_array(arr, arrlength);
f1(T,arr,arrlength); // --> why does this run slower than f2?
//print_arr(arr, arrlength);
printf("\n\n");
initialize_array(arr, arrlength);
f2(T,arr,arrlength);
//print_arr(arr, arrlength);
return 0;
}
EDIT: I get the same time measurements when I call f2
first and f1
later or even if I run it multiple times.
With -O2 enabled I get the timings as 0.000 on both. I was curious about the default settings gcc was using for compiling and why there was such a big difference in performance. The answer as wohlstad suggests must of course be in the assembly, but unfortunately, I cannot read x86 assembly well at all :-( for an informed understanding
It turns out there's quite a bit going on at the assembly level in f1
for this instruction:
arr[i] += 1;
This compiles to:
mov eax, DWORD PTR [rbp-4]
cdqe
lea rdx, [0+rax*4]
mov rax, QWORD PTR [rbp-48]
add rax, rdx
mov edx, DWORD PTR [rax]
mov eax, DWORD PTR [rbp-4]
cdqe
lea rcx, [0+rax*4]
mov rax, QWORD PTR [rbp-48]
add rax, rcx
add edx, 1
mov DWORD PTR [rax], edx
Which is run on each iteration of the inner loop.
While this line in f2
:
tmp += 1;
Complies down to this:
add DWORD PTR [rbp-8], 1
And the assignment back:
arr[i] = tmp;
Compiles to this:
mov eax, DWORD PTR [rbp-4]
cdqe
lea rdx, [0+rax*4]
mov rax, QWORD PTR [rbp-64]
add rdx, rax
mov eax, DWORD PTR [rbp-8]
mov DWORD PTR [rdx], eax
Which only runs in the outer loop. This explains the difference in runtime.