I want to setup a big matrix on my GPU to solve the according equation system with CULA.
Some numbers for you, to understand the problem:
big matrix: 400x400
small matrices: 200x200
Now I want to copy every quarter (100x100)
of the small matrix to a specific part of the second matrix.
I found two possible but obviously slow examples: cublasSetMatrix
and cublasGetMatrix
support the specification of a leading dimension, so I could put the parts, where I want them, but have to copy the matrix back to the host.
The other example would be cudaMemcpy
, which doesn't support leading dimensions. Here I could copy every single row/column (at the moment I am unsure what is used by this routine, data comes from Fortran) by hand. But this way, I should get a big overhead...
Is there a better way than writing my own kernel, to copy the matrix?
You may revise your Q. I guess you are finding a way that can both change the leading dimension and do D2Dcpy.
There is a routine cudaMemcpy2D()
can do that as shown in here.