Defining dynamically allocated global device array in CUDA

So I want a device array that is dynamically allocated based on some host variable. This array should be visible to all device functions hence the global scope.

I decided to make the use of cudaMemcpyToSymbol by defining a global device pointer and attaching a value onto it later on from the host code.

const int k = 10;
__device__ int *divi;

void func()
{
   int data[k] = {1,2,3}; //so the first three values are 1,2,3 and the rest are zero
   cudaMemcpyToSymbol("divi", data, k*sizeof(int), 0, cudaMemcpyHostToDevice);
}

__global__ void kernel()
{
   //do something with the divi array
}

int main()
{
    func();
    kernel<<<1,128>>>();
}

Using CUDA error check functions returns a failure when the memory copy operation is performed. Am I doing something wrong here?

Solution

The problem is that you haven't allocated memory for the device buffer before copying into it.
cudaMalloc can be used for that.
You also need to free the device buffers eveuntually, with a matching call to cudaFree.

Another issue is that you can avoid having a global device pointer (and size), and pass them to the kernel as agruments.

The code below demonstrates it:

#include <stdio.h>
#include "cuda_runtime.h"


__global__ void kernel(int * pDataDevice, size_t dataSize)
{
    // Print just for a debug test, assuming array has at least 5 elements:
    // (Note that in your final code it is better to avoids prints from within a kernel).
    printf("thread %d, data: %d %d %d %d %d ...\n", threadIdx.x, pDataDevice[0], pDataDevice[1], pDataDevice[2], pDataDevice[3], pDataDevice[4]);

    // Do something with the pDataDevice array
}


int main()
{
    const size_t k = 10;
    int dataHost[k] = { 1,2,3 }; // The first three values are 1,2,3 and the rest are zero
    int* pDataDevice = nullptr;

    // Allocate device buffer and copy into it:
    cudaMalloc(&pDataDevice, k * sizeof(dataHost[0]));  // TODO: Check return code and handle errors.
    cudaMemcpy(pDataDevice, dataHost, k * sizeof(dataHost[0]), cudaMemcpyHostToDevice); // TODO: Check return code and handle errors.

    // Run the kernel:
    kernel << <1, 128 >> > (pDataDevice, k);

    // Free device buffer:
    cudaFree(pDataDevice);      // TODO: Check return code and handle errors.
}

All the cuda functions return a cudaError_t status code which I ignore in my code above to keep things simple.
In your actual code you should check it and handle errors accordingly.