I'm writing a function that does a lot of BLAS gemv operations.
I would like to be able to do this on the GPU, and I've tried with cuBlas.
My problem is that my matrix's and vectors are rather small, 100x100 matrix and 100 vector. CuBlas takes ages compared to a CPU and I see why, a mixture of fast cache on the cpu and a large overhead on doing the calls to the GPU.
Therefore I'm trying to figure out a smart way of measuring the time it takes to communicate the call to the GPU.
That is the time it takes CUDA to setup the call and send it to the graphics processor -- not counting the time it actually takes to do the matrix-vector multiplication.
How would I go about doing this?
Update: The following results are for a hand-written FFT GPU algorithm on 2005 hardware (nVidia 7800 GTX), but shows the principle of CPU-GPU tranfer bottlenecks
The overhead is not the call per-se but compilation of the GPU program and transferring the data between the GPU and the host. The CPU is highly optimized for functions that can be performed entirely in cache and the latency of DDR3 memory is far lower than the PCI-Express bus which services the GPU. I have experienced this myself when writing GPU FFT routines (prior to CUDA). Please see this related question.
N FFTw (ms) GPUFFT (ms) GPUFFT MFLOPS GPUFFT Speedup 8 0 0.06 3.352705 0.006881 16 0.001 0.065 7.882117 0.010217 32 0.001 0.075 17.10887 0.014695 64 0.002 0.085 36.080118 0.026744 128 0.004 0.093 76.724324 0.040122 256 0.007 0.107 153.739856 0.066754 512 0.015 0.115 320.200892 0.134614 1024 0.034 0.125 657.735381 0.270512 2048 0.076 0.156 1155.151507 0.484331 4096 0.173 0.215 1834.212989 0.804558 8192 0.483 0.32 2664.042421 1.510011 16384 1.363 0.605 3035.4551 2.255411 32768 3.168 1.14 3450.455808 2.780041 65536 8.694 2.464 3404.628083 3.528726 131072 15.363 5.027 3545.850483 3.05604 262144 33.223 12.513 3016.885246 2.655183 524288 72.918 25.879 3079.443664 2.817667 1048576 173.043 76.537 2192.056517 2.260904 2097152 331.553 157.427 2238.01491 2.106081 4194304 801.544 430.518 1715.573229 1.861814
The table above shows timings of a GPU FFT implementation vs CPU implementation based on kernel size. For smaller sizes, the transfer of data to/from the GPU dominates. Smaller kernels can be performed on the CPU, some implementations/sizes entirely in the cache. This makes the CPU the best choice for small operations.
If on the other hand you need to perform large batches of work on data with minimal moves to/from the GPU then the GPU will beat the CPU hands down.
In so far as measuring the effect in your example, I would suggest performing an experiment like the above. Try to work out the FLOPS computed for each size of matrix and run the test on the CPU and GPU for varying sizes of matrix. Output to a CSV file the size, time and FLOPS for GPU vs CPU. For any profiling ensure you run several hundred iterations of your code and time the whole thing, then divide the total time by iterations to get the loop time. Try different shaped matrices also if your algorithm allows (e.g. 10x100 rather than 100x10).
Using this data you can get a feel for what the overheads are. To find out exactly repeat the same experiment but replace the inner shader code executed on the GPU with no-operation (simply copy from input to output).
Hope this helps,