If there is a matrix addition application that is implemented by hybrid CPU-GPU (in CUDA (i.e) using pthreads where each thread performs a partial matrix addition in host CPU and in GPU), for instance, if the matrix size is 1000, first 500 will be computed by host-CPU and the rest by GPU, basically the computation is split between cpu and gpu, so is this the best when compared to CPU only computation and GPU only computation. Please, help me understand this concept.
Is there any profiling tool that will help find such kind of computation performance between those 3 ?. I'm new to CUDA so any help/guidance will be appreciated.
Thank you!
The problem with CPU-GPU hybrid computations where you need the result back on CPU is the latency between the two. If you expect to do some computation on GPU and have the result back on CPU, there can be easily several milliseconds of delay from starting the computation on GPU to get the results back on CPU, so the amount of work done on GPU should be significant. Or you need significant amount of work on CPU between starting GPU computation and getting the results back from GPU. Performing 1000 element matrix addition is tiny amount of work thus you would be better off performing the entire computation on CPU instead. You also have the overhead of transferring the data back and forth between the CPU & GPU across the PCI bus which adds to the overhead, so computations which require small amount of data transferred between the two lean more towards hybrid solution.
If you never need to read the result back from GPU to CPU, then you don't have the latency issue though. For example you could do N-body simulation on GPU and perform visualization on GPU as well thus never needing the result on CPU. But the moment you need the result of the simulation back to CPU you have to deal with the latency issue.