Just a general question about cublas. For a single thread, if there is not memory transfer from GPU to CPU (e.g. cublasGetVector), will the cublas kernel functions (eg cublasDgemm) automatically be synchronized with the host?
Furthermore, what about between two adjacent kernel calls?
and, what about a synchronized transfer that does not involve the global memory used in the previous kernel?
No, the CUBLAS API is, with the exception of a few Level 1 routines which return a scalar value, asynchronous.
Level 3 routines like cublasDgemm
don't block the host, you need to call a blocking API routine like a synchronous memory transfer or an explicit host-GPU synchronisation call to ensure that the CUBLAS call has completed.