I am writing an code that does calculations with thousands of sparse matrices on the GPU using cuSparse. Because memory is limited on the GPU, I need to treat them one by one as the rest of the memory is taken up by other GPU variables and dense matrices.
My work flow (in pseudo-code) is the following:
for (i=0;i<1000;i++){
//allocate sparse matrix using cudaMalloc
//copy sparse matrix from host using cudaMemcpy
//do calculation by calling cuSparse
//deallocate sparse matrix with cudaFree
}
In the above, I allocate and free the memory for each sparse matrix in each step because their sparsities vary and therefore the memory needed by each one varies.
Can I instead do something like:
//allocate buffer once in the beginning using cudaMalloc with some extra space such
//that even the sparse matrix with the highest density would fit.
for (i=0;i<1000;i++){
//copy sparse matrix from host using cudaMemcpy to the same buffer
//do calculation by calling cuSparse
}
//free the buffer once at the end using cudaFree
The above avoids having to malloc and free the buffer in each iteration. Would the above work? Would it improve performance? Is it good practice or is there a better way to do this?
The above avoids having to malloc and free the buffer in each iteration. Would the above work?
In principle, yes.
Would it improve performance?
Probably. Memory allocation and deallocation isn't without latency.
Is it good practice or is there a better way to do this?
Generally speaking, yes. Lots of widely used GPU accelerated frameworks (Tensorflow, for example) use this strategy to reduce the cost of memory management on the GPU. Whether there is benefit for your use case requires you testing it for yourself.