Reset the values of a CUDA Array

When we have a linear array d_A in CUDA, one can reset all of its values to 0 by simply

cudaMemset(d_A, 0, K*K*sizeof(unsigned int) )

and this works very fast (I suppose at the rate of the global memory bandwidth). If we now have a CUDA array cu_A that is needed for texture/surface memory, I could not find an equivalent function to reset its values, so I have tried two workarounds: 1) store a linear array d_A full of zeros and copy it to the CUDA array:

cudaMemcpyToArray(cu_A, 0, 0, d_A, K*K*sizeof(unsigned int), cudaMemcpyHostToDevice);

I found that the speed of this copy is about 10% of my global memory bandwidth, so a bit underwhelming. Then I tried option 2), where I store another CUDA array cu_B which has pre-copied zeros in it, and then copy that to the main CUDA array. Here is the minimal working example:

#include "mex.h"
#include "gpu/mxGPUArray.h"
#define K 4096 // data dimension

void mexFunction(int nlhs, mxArray *plhs[],
        int nrhs, mxArray const *prhs[])
{
    mxInitGPU();    
    // Declare the density field
    mwSize const Asize[] = { K, K };
    mxGPUArray *A = mxGPUCreateGPUArray(2, Asize, mxUINT32_CLASS, mxREAL, MX_GPU_INITIALIZE_VALUES); // initialized to zeros
    unsigned int *d_A = (unsigned int *)(mxGPUGetData(A));

    // Allocate CUDA arrays in device memory
    cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindUnsigned);
    cudaArray* cu_A;
    cudaArray* cu_B;
    cudaMallocArray(&cu_A, &channelDesc, K, K, cudaArraySurfaceLoadStore);
    cudaMallocArray(&cu_B, &channelDesc, K, K, cudaArraySurfaceLoadStore);

    /* Store the blank CUDA array here */
    cudaMemcpyToArray(cu_B, 0, 0, d_A, K*K*sizeof(unsigned int), cudaMemcpyHostToDevice);

    for (int timeStep = 0; timeStep<1000; timeStep++) {
        cudaMemcpyArrayToArray ( cu_A, 0, 0, cu_B, 0, 0, K*K*sizeof(unsigned int), cudaMemcpyDeviceToDevice ); // Reset the working memory
    }

    mxGPUDestroyGPUArray(A);
    cudaFreeArray(cu_A);
    cudaFreeArray(cu_B);
}

To my dismay, this Array-to-Array copy is running at a rate of merely

(4096*4096 elements)*(1000 iterations)*(4 bits)/(measured 9.6 s) = 7 Gb/s

out of 288 Gb/s that my Quadro P5000 should be capable of.

Do these figures make sense and is there a faster way to reset a CUDA array?

Solution

If we remove matlab from the scenario, things appear to be working about how I would expect.

First of all, this line is incorrect, and I would always recommend using proper cuda error checking:

cudaMemcpyToArray(cu_B, 0, 0, d_A, K*K*sizeof(unsigned int), cudaMemcpyHostToDevice);

This is copying from a device pointer (d_A) to a cudaArray pointer (cu_B) which also is on the device. Therefore the correct transfer direction is cudaMemcpyDeviceToDevice.

If I craft a version of your code without the matlab harness, and run it using nvprof, I witness timings that are approximately in line with what I would expect. Here's a complete sample code, following yours, that does not depend on matlab:

$ cat t444.cu
#include <stdio.h>


int main(){
    int K = 4096;
    cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindUnsigned);
    cudaArray* cu_A;
    cudaArray* cu_B;
    cudaMallocArray(&cu_A, &channelDesc, K, K, cudaArraySurfaceLoadStore);
    cudaMallocArray(&cu_B, &channelDesc, K, K, cudaArraySurfaceLoadStore);
    unsigned int *d_A;
    cudaMalloc(&d_A, K*K*sizeof(unsigned int));
    /* Store the blank CUDA array here */
    cudaMemcpyToArray(cu_B, 0, 0, d_A, K*K*sizeof(unsigned int), cudaMemcpyDeviceToDevice);

    for (int timeStep = 0; timeStep<10; timeStep++) {
        cudaMemcpyArrayToArray ( cu_A, 0, 0, cu_B, 0, 0, K*K*sizeof(unsigned int), cudaMemcpyDeviceToDevice ); // Reset the working memory
    }
    cudaDeviceSynchronize();
}

When I compile the code and run it on a Pascal Titan X device (similar to a Quadro P5000) I get the following profiler output:

$ nvprof --print-gpu-trace ./t444
==16315== NVPROF is profiling process 16315, command: ./t444
==16315== Profiling application: ./t444
==16315== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
520.69ms  455.13us                    -               -         -         -         -  64.000MB  137.32GB/s      Device       Array  TITAN X (Pascal         1         7  [CUDA memcpy DtoA]
521.15ms  453.98us                    -               -         -         -         -  64.000MB  137.67GB/s       Array      Device  TITAN X (Pascal         1         7  [CUDA memcpy AtoD]
521.60ms  453.63us                    -               -         -         -         -  64.000MB  137.78GB/s      Device       Array  TITAN X (Pascal         1         7  [CUDA memcpy DtoA]
524.36ms  453.82us                    -               -         -         -         -  64.000MB  137.72GB/s       Array      Device  TITAN X (Pascal         1         7  [CUDA memcpy AtoD]
524.82ms  453.69us                    -               -         -         -         -  64.000MB  137.76GB/s      Device       Array  TITAN X (Pascal         1         7  [CUDA memcpy DtoA]
527.57ms  453.85us                    -               -         -         -         -  64.000MB  137.71GB/s       Array      Device  TITAN X (Pascal         1         7  [CUDA memcpy AtoD]
528.03ms  453.69us                    -               -         -         -         -  64.000MB  137.76GB/s      Device       Array  TITAN X (Pascal         1         7  [CUDA memcpy DtoA]
530.78ms  453.91us                    -               -         -         -         -  64.000MB  137.69GB/s       Array      Device  TITAN X (Pascal         1         7  [CUDA memcpy AtoD]
531.24ms  453.69us                    -               -         -         -         -  64.000MB  137.76GB/s      Device       Array  TITAN X (Pascal         1         7  [CUDA memcpy DtoA]
533.99ms  453.72us                    -               -         -         -         -  64.000MB  137.75GB/s       Array      Device  TITAN X (Pascal         1         7  [CUDA memcpy AtoD]
534.44ms  453.50us                    -               -         -         -         -  64.000MB  137.82GB/s      Device       Array  TITAN X (Pascal         1         7  [CUDA memcpy DtoA]
537.20ms  453.82us                    -               -         -         -         -  64.000MB  137.72GB/s       Array      Device  TITAN X (Pascal         1         7  [CUDA memcpy AtoD]
537.65ms  453.59us                    -               -         -         -         -  64.000MB  137.79GB/s      Device       Array  TITAN X (Pascal         1         7  [CUDA memcpy DtoA]
540.42ms  453.91us                    -               -         -         -         -  64.000MB  137.69GB/s       Array      Device  TITAN X (Pascal         1         7  [CUDA memcpy AtoD]
540.88ms  453.59us                    -               -         -         -         -  64.000MB  137.79GB/s      Device       Array  TITAN X (Pascal         1         7  [CUDA memcpy DtoA]
543.63ms  453.82us                    -               -         -         -         -  64.000MB  137.72GB/s       Array      Device  TITAN X (Pascal         1         7  [CUDA memcpy AtoD]
544.09ms  453.56us                    -               -         -         -         -  64.000MB  137.80GB/s      Device       Array  TITAN X (Pascal         1         7  [CUDA memcpy DtoA]
547.66ms  453.79us                    -               -         -         -         -  64.000MB  137.73GB/s       Array      Device  TITAN X (Pascal         1         7  [CUDA memcpy AtoD]
548.11ms  453.53us                    -               -         -         -         -  64.000MB  137.81GB/s      Device       Array  TITAN X (Pascal         1         7  [CUDA memcpy DtoA]
550.87ms  453.88us                    -               -         -         -         -  64.000MB  137.70GB/s       Array      Device  TITAN X (Pascal         1         7  [CUDA memcpy AtoD]
551.32ms  453.56us                    -               -         -         -         -  64.000MB  137.80GB/s      Device       Array  TITAN X (Pascal         1         7  [CUDA memcpy DtoA]

So there are a few observations:

the properly formatted cudaMemcpyToArray operation runs at approximately the device global memory bandwidth. The operation involves both a read and a write per byte, and the measured throughput (bytes transferred/time) is ~140GB/s, meaning the delivered bandwidth for this operation (one read and one write per byte) is ~280GB/s. This is probably your best choice for this array initialization operation.
the cudaMemcpyArrayToArray operation gets decomposed by the CUDA runtime into 2 separate operations, a copy from array to linear buffer, followed by a copy from linear buffer to array. Therefore I would expect this option to run at half the speed of the option 1 above.

I would expect that if you profiled your code you should find these operations running at about this rate. The remainder of the time in your 9.6s measurement is probably due to operations other than these transfers, such as matlab overhead, CUDA initialization overhead, and CUDA overhead associated with other CUDA runtime calls in your code.