I am currently trying my first dynamic parallelism code in CUDA. It is pretty simple. In the parent kernel I am doing something like this:
int aPayloads[32];
// Compute aPayloads start values here
int* aGlobalPayloads = nullptr;
cudaMalloc(&aGlobalPayloads, (sizeof(int) *32));
cudaMemcpyAsync(aGlobalPayloads, aPayloads, (sizeof(int)*32), cudaMemcpyDeviceToDevice));
mykernel<<<1, 1>>>(aGlobalPayloads); // Modifies data in aGlobalPayloads
cudaDeviceSynchronize();
// Access results in payload array here
Assuming that I do things right so far, what is the fastest way to access the results in aGlobalPayloads
after kernel execution? (I tried cudaMemcpy()
to copy aGlobalPayloads
back to aPayloads
but cudaMemcpy()
is not allowed in device code).
You can directly access the data in aGlobalPayloads
from your parent kernel code, without any copying:
mykernel<<<1, 1>>>(aGlobalPayloads); // Modifies data in aGlobalPayloads
cudaDeviceSynchronize();
int myval = aGlobalPayloads[0];
I'd encourage careful error checking (Read the whole accepted answer here). You do it in device code the same way as in host code. The programming guide states: "May not pass in local or shared memory pointers". Your usage of aPayloads
is a local memory pointer.
If for some reason you want that data to be explicitly put back in your local array, you can use in-kernel memcpy
for that:
memcpy(aPayloads, aGlobalPayloads, sizeof(int)*32);
int myval = aPayloads[0]; // retrieves the same value
(that is also how I would fix the issue I mention in item 2 - use in-kernel memcpy
)