Understanding the difference in timing of two functions that increment each element of an integer array

In the code pasted below, I am running two functions f1 and f2 which perform the same exact job. Take a number T and integer array arr (that has been initialized to 0 everywhere) and then increment T times each element of arr Thus by the end of both f1 and f2, the input 0,0...0 should become T,T...T.

What I don't understand is why f1 runs so much slower than f2 (about 1.76 times slower when T is, say, 1 billion). Here is my output

➜ Desktop gcc timing-difference.c && ./a.out 1000000000 16

-----> Running f1 <------- Time taken: 15.511 seconds

-----> Running f2 <------- Time taken: 8.887 seconds%

Here T is supplied to the program as argv[1] and the array length as argv[2]. The timing-difference.c file is pasted below at the end of this post.

Basically, f1 is directly incrementing each arr[i], whereas f2 is using a temporary variable tmp for the increment, and then assigning it to arr[i] when done.

#include<stdio.h>
#include<unistd.h>
#include<stdlib.h>
#include<time.h>

/*Directly increment arr[i] for each i*/
void f1(int T, int* arr, int arrlength){

  printf("\n-----> Running f1 <-------\n");
  clock_t end, start;
  double cpu_time_used;

  // For each element of arr, increment it T times.
  start = clock();
  for(int i = 0 ;  i<arrlength ;++i){
    for (int j=0 ; j<T ; ++j){
      arr[i] += 1;
    }
  }
  end = clock();
  cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC; 

  printf("Time taken: %.3f seconds", cpu_time_used);

}

/*increment arr[i] using temporary variable tmp*/
void f2(int T, int* arr, int arrlength){

  printf("\n-----> Running f2 <-------\n");
  clock_t end, start;
  double cpu_time_used;

  // For each element of arr, increment it T times.
  start = clock();
  for(int i = 0 ;  i<arrlength ;++i){

    int tmp=arr[i];
    for (int j=0 ; j<T ; ++j){
      tmp += 1;
    }
    arr[i] = tmp; 
  }

  end = clock();
  cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC; 
  printf("Time taken: %.3f seconds", cpu_time_used);

}

void print_arr(int* arr, int arrlength){
  for(int i = 0 ; i<arrlength ; ++i){
    printf("%d  ", arr[i]);
  }
  printf("\n\n");

}

/*Zero initialize array*/
void initialize_array(int* arr, int arrlength){
  for (int i=0 ; i<arrlength ; ++i){
    arr[i] = 0;
  }

}

int main(int argc, char** argv){

  int T         = atoi(argv[1]);
  int arrlength = atoi(argv[2]);
  int arr[arrlength];
  
  initialize_array(arr, arrlength);
  f1(T,arr,arrlength);  // --> why does this run slower than f2?
  //print_arr(arr, arrlength);
    
  printf("\n\n");

  initialize_array(arr, arrlength);
  f2(T,arr,arrlength); 
  //print_arr(arr, arrlength);
  
  return 0;
}

EDIT: I get the same time measurements when I call f2 first and f1 later or even if I run it multiple times.

With -O2 enabled I get the timings as 0.000 on both. I was curious about the default settings gcc was using for compiling and why there was such a big difference in performance. The answer as wohlstad suggests must of course be in the assembly, but unfortunately, I cannot read x86 assembly well at all :-( for an informed understanding

Solution

It turns out there's quite a bit going on at the assembly level in f1 for this instruction:

arr[i] += 1;

This compiles to:

        mov     eax, DWORD PTR [rbp-4]
        cdqe
        lea     rdx, [0+rax*4]
        mov     rax, QWORD PTR [rbp-48]
        add     rax, rdx
        mov     edx, DWORD PTR [rax]
        mov     eax, DWORD PTR [rbp-4]
        cdqe
        lea     rcx, [0+rax*4]
        mov     rax, QWORD PTR [rbp-48]
        add     rax, rcx
        add     edx, 1
        mov     DWORD PTR [rax], edx

Which is run on each iteration of the inner loop.

While this line in f2:

tmp += 1;

Complies down to this:

        add     DWORD PTR [rbp-8], 1

And the assignment back:

arr[i] = tmp;

Compiles to this:

        mov     eax, DWORD PTR [rbp-4]
        cdqe
        lea     rdx, [0+rax*4]
        mov     rax, QWORD PTR [rbp-64]
        add     rdx, rax
        mov     eax, DWORD PTR [rbp-8]
        mov     DWORD PTR [rdx], eax

Which only runs in the outer loop. This explains the difference in runtime.