How to copy a matrix in a bigger matrix in CUDA

I want to setup a big matrix on my GPU to solve the according equation system with CULA.

Some numbers for you, to understand the problem:

big matrix:     400x400
small matrices: 200x200

Now I want to copy every quarter (100x100) of the small matrix to a specific part of the second matrix.

I found two possible but obviously slow examples: cublasSetMatrix and cublasGetMatrix support the specification of a leading dimension, so I could put the parts, where I want them, but have to copy the matrix back to the host. The other example would be cudaMemcpy, which doesn't support leading dimensions. Here I could copy every single row/column (at the moment I am unsure what is used by this routine, data comes from Fortran) by hand. But this way, I should get a big overhead...

Is there a better way than writing my own kernel, to copy the matrix?

Solution

You may revise your Q. I guess you are finding a way that can both change the leading dimension and do D2Dcpy.

There is a routine cudaMemcpy2D() can do that as shown in here.