memory-management cuda dynamic-parallelism

CUDA dynamic parallelism: Access child kernel results in global memory

I am currently trying my first dynamic parallelism code in CUDA. It is pretty simple. In the parent kernel I am doing something like this:

int aPayloads[32];
// Compute aPayloads start values here

int* aGlobalPayloads = nullptr;
cudaMalloc(&aGlobalPayloads, (sizeof(int) *32));
cudaMemcpyAsync(aGlobalPayloads, aPayloads, (sizeof(int)*32), cudaMemcpyDeviceToDevice));

mykernel<<<1, 1>>>(aGlobalPayloads); // Modifies data in aGlobalPayloads
cudaDeviceSynchronize();

// Access results in payload array here

Assuming that I do things right so far, what is the fastest way to access the results in aGlobalPayloads after kernel execution? (I tried cudaMemcpy() to copy aGlobalPayloads back to aPayloads but cudaMemcpy() is not allowed in device code).

Solution

You can directly access the data in aGlobalPayloads from your parent kernel code, without any copying:

mykernel<<<1, 1>>>(aGlobalPayloads); // Modifies data in aGlobalPayloads
cudaDeviceSynchronize();
int myval = aGlobalPayloads[0];

I'd encourage careful error checking (Read the whole accepted answer here). You do it in device code the same way as in host code. The programming guide states: "May not pass in local or shared memory pointers". Your usage of aPayloads is a local memory pointer.
If for some reason you want that data to be explicitly put back in your local array, you can use in-kernel memcpy for that:
```
memcpy(aPayloads, aGlobalPayloads, sizeof(int)*32);
int myval = aPayloads[0]; // retrieves the same value
```
(that is also how I would fix the issue I mention in item 2 - use in-kernel memcpy)