ArrayFire CUDA application is extremely slow in the first minute

I am writing a test program using ArrayFire running on windows 10 + Nvidia Gtx 970. The program is to train a neural network with SGD solver. Thus the main computation is the iteration to update network parameters. The iteration is in a function called step().

The program does what is expected except that it performs extremely slow in the first minute. The following is the output of the program. The first column is the elapsed time.

ArrayFire v3.5.1 (CUDA, 64-bit Windows, build 0a675e8)
Platform: CUDA Toolkit 8, Driver: CUDA Driver Version: 8000
[0] GeForce GTX 970, 4096 MB, CUDA Compute 5.2
  time epochs training error
     5  0.002 5.6124567
     6  0.007 5.5981609
     7  0.010 5.3560046
     8  0.015 5.2485286
     9  0.020 5.1370633
    10  0.022 5.1081303
     ....
    52  0.148 3.2528560
    53  0.150 3.2425120
    54  0.153 3.2180901
    55  0.155 3.2048657
    56  0.157 3.1949191
    57  0.158 3.1816899
    58  0.160 3.1717312
    59  0.162 3.1597322
    60  0.165 3.1370639
    60  0.498 2.1359600
    61  0.548 2.0685355
    61  0.882 1.7098215
    62  0.943 1.6575973
    62  1.277 1.4156345
    63  1.343 1.3845720
    63  1.677 1.1789854
    64  1.733 1.1549067
    64  2.067 1.0162785
     ....
    71  4.517 0.4732214
    71  4.850 0.4522045
    72  4.910 0.4501807
    72  5.243 0.4355422
    73  5.305 0.4307187

As you can see, in the first minute, it did not even finish 1/5 of an epoch. But after one minute, it suddenly speeded up to complete one epoch in around 4 seconds.

The profiling data also tells the same thing: in the first minute, the average execution time of function step() is around 500 ms, but after the first minute, it drops to 6 ms.

Nvidia visual profiler shows the kernel is almost idle all the time in the first minute.

I have no clue what could cause the change of performance before|after the first minute. Any help is appreciated.

Solution

ArrayFire uses JIT compilation at runtime to fuse multiple calls to functions. So when you perform an addition or any other element-wise operation, ArrayFire will create a custom kernel and execute this kernel. This has some overhead when you first generate this kernel but these kernels are cached and additional calls do not need to be compiled. Usually, it should only require a couple of iterations before additional compilations are not necessary. It's odd that the kernels are slow even after 60 or so iterations.

JIT kernels are evaluated using an internal heuristics based on memory and size of the kernels. Perhaps your application is not triggering the kernels optimally and causing additional kernel compilations. You could get around this by forcing the evaluation by calling the eval function on a variable. Here is a contrived example:

array a = randu(10, 10);
array b = randu(10, 10);
for(int i = 0; i < 100; i++) {
      a += b / 4;
      b *= i;
      eval(a, b);
}

Here you are evaluating the JIT tree for variable a and b at each iteration. This will reuse the same kernel at each iteration instead of creating a kernel for different multiples of iterations.

One thing to note is that element-wise, and some conditional functions like select and shift are JITed. Other functions force evaluation of their parameters before they are used. Also if you evaluate too often you will decrease the performance of your application.