I am writing a test program using ArrayFire running on windows 10 + Nvidia Gtx 970. The program is to train a neural network with SGD solver. Thus the main computation is the iteration to update network parameters. The iteration is in a function called step()
.
The program does what is expected except that it performs extremely slow in the first minute. The following is the output of the program. The first column is the elapsed time.
ArrayFire v3.5.1 (CUDA, 64-bit Windows, build 0a675e8) Platform: CUDA Toolkit 8, Driver: CUDA Driver Version: 8000 [0] GeForce GTX 970, 4096 MB, CUDA Compute 5.2 time epochs training error 5 0.002 5.6124567 6 0.007 5.5981609 7 0.010 5.3560046 8 0.015 5.2485286 9 0.020 5.1370633 10 0.022 5.1081303 .... 52 0.148 3.2528560 53 0.150 3.2425120 54 0.153 3.2180901 55 0.155 3.2048657 56 0.157 3.1949191 57 0.158 3.1816899 58 0.160 3.1717312 59 0.162 3.1597322 60 0.165 3.1370639 60 0.498 2.1359600 61 0.548 2.0685355 61 0.882 1.7098215 62 0.943 1.6575973 62 1.277 1.4156345 63 1.343 1.3845720 63 1.677 1.1789854 64 1.733 1.1549067 64 2.067 1.0162785 .... 71 4.517 0.4732214 71 4.850 0.4522045 72 4.910 0.4501807 72 5.243 0.4355422 73 5.305 0.4307187
As you can see, in the first minute, it did not even finish 1/5 of an epoch. But after one minute, it suddenly speeded up to complete one epoch in around 4 seconds.
The profiling data also tells the same thing: in the first minute, the average execution time of function step()
is around 500 ms, but after the first minute, it drops to 6 ms.
Nvidia visual profiler shows the kernel is almost idle all the time in the first minute.
I have no clue what could cause the change of performance before|after the first minute. Any help is appreciated.
ArrayFire uses JIT compilation at runtime to fuse multiple calls to functions. So when you perform an addition or any other element-wise operation, ArrayFire will create a custom kernel and execute this kernel. This has some overhead when you first generate this kernel but these kernels are cached and additional calls do not need to be compiled. Usually, it should only require a couple of iterations before additional compilations are not necessary. It's odd that the kernels are slow even after 60 or so iterations.
JIT kernels are evaluated using an internal heuristics based on memory and size of the kernels. Perhaps your application is not triggering the kernels optimally and causing additional kernel compilations. You could get around this by forcing the evaluation by calling the eval function on a variable. Here is a contrived example:
array a = randu(10, 10);
array b = randu(10, 10);
for(int i = 0; i < 100; i++) {
a += b / 4;
b *= i;
eval(a, b);
}
Here you are evaluating the JIT tree for variable a and b at each iteration. This will reuse the same kernel at each iteration instead of creating a kernel for different multiples of iterations.
One thing to note is that element-wise, and some conditional functions like select and shift are JITed. Other functions force evaluation of their parameters before they are used. Also if you evaluate too often you will decrease the performance of your application.