The overhead of a OpenCL or CUDA call?

I'm writing a function that does a lot of BLAS gemv operations.

I would like to be able to do this on the GPU, and I've tried with cuBlas.

My problem is that my matrix's and vectors are rather small, 100x100 matrix and 100 vector. CuBlas takes ages compared to a CPU and I see why, a mixture of fast cache on the cpu and a large overhead on doing the calls to the GPU.

Therefore I'm trying to figure out a smart way of measuring the time it takes to communicate the call to the GPU.

That is the time it takes CUDA to setup the call and send it to the graphics processor -- not counting the time it actually takes to do the matrix-vector multiplication.

How would I go about doing this?


  • Update: The following results are for a hand-written FFT GPU algorithm on 2005 hardware (nVidia 7800 GTX), but shows the principle of CPU-GPU tranfer bottlenecks

    The overhead is not the call per-se but compilation of the GPU program and transferring the data between the GPU and the host. The CPU is highly optimized for functions that can be performed entirely in cache and the latency of DDR3 memory is far lower than the PCI-Express bus which services the GPU. I have experienced this myself when writing GPU FFT routines (prior to CUDA). Please see this related question.

    N       FFTw (ms)   GPUFFT (ms)     GPUFFT MFLOPS   GPUFFT Speedup
    8         0           0.06             3.352705     0.006881
    16        0.001       0.065            7.882117     0.010217
    32        0.001       0.075           17.10887      0.014695
    64        0.002       0.085           36.080118     0.026744
    128       0.004       0.093           76.724324     0.040122
    256       0.007       0.107          153.739856     0.066754
    512       0.015       0.115          320.200892     0.134614
    1024      0.034       0.125          657.735381     0.270512
    2048      0.076       0.156         1155.151507     0.484331
    4096      0.173       0.215         1834.212989     0.804558
    8192      0.483       0.32          2664.042421     1.510011
    16384     1.363       0.605         3035.4551       2.255411
    32768     3.168       1.14          3450.455808     2.780041
    65536     8.694       2.464         3404.628083     3.528726
    131072   15.363       5.027         3545.850483     3.05604
    262144   33.223      12.513         3016.885246     2.655183
    524288   72.918      25.879         3079.443664     2.817667
    1048576 173.043      76.537         2192.056517     2.260904
    2097152 331.553     157.427         2238.01491      2.106081
    4194304 801.544     430.518         1715.573229     1.861814

    The table above shows timings of a GPU FFT implementation vs CPU implementation based on kernel size. For smaller sizes, the transfer of data to/from the GPU dominates. Smaller kernels can be performed on the CPU, some implementations/sizes entirely in the cache. This makes the CPU the best choice for small operations.

    If on the other hand you need to perform large batches of work on data with minimal moves to/from the GPU then the GPU will beat the CPU hands down.

    In so far as measuring the effect in your example, I would suggest performing an experiment like the above. Try to work out the FLOPS computed for each size of matrix and run the test on the CPU and GPU for varying sizes of matrix. Output to a CSV file the size, time and FLOPS for GPU vs CPU. For any profiling ensure you run several hundred iterations of your code and time the whole thing, then divide the total time by iterations to get the loop time. Try different shaped matrices also if your algorithm allows (e.g. 10x100 rather than 100x10).

    Using this data you can get a feel for what the overheads are. To find out exactly repeat the same experiment but replace the inner shader code executed on the GPU with no-operation (simply copy from input to output).

