Search code examples
cudagpgpucublas

Getting pointers to specific elements of a 1D contiguous array on the device


I am trying to use CUBLAS in C++ to rewrite a python/tensorflow script which is operating on batches of input samples (of shape BxD, B: BatchSize, D: Depth of the flattened 2D matrix)

For the first step, I decided to use CUBLAS cublasSgemmBatched to compute MatMul for batches of matrices.

I've found couple working sample codes as the one in link to the question, but what I want is to allocate one big contiguous device array to store batches of flattened identical shaped matrices. I DO NOT want to store batches separated from each other on device memory(as they are in the provided sample code in the given link to StackOverflow question)

From what I can imagine, somehow I have to get a list of pointers to starting elements of each batch on device memory. something like this:

float **device_batch_ptr;
cudaMalloc((void**)&device_batch_ptr, batch_size*sizeof(float *));
for(int i = 0 ; i < batch_size; i++ ) {
    // set device_batch_ptr[i] to starting point of i'th batch on device memory array.
}

Note that cublasSgemmBatched needs a float** that each float* in it, points to starting element of each batch in a given input matrix.

Any advice and suggestions will be greatly appreciated.


Solution

  • If your arrays are in contiguous linear memory (device_array) then all you need to do is calculate the offsets using standard pointer arithmetic and store the device addresses in a host array which you then copy to the device. Something like:

    float** device_batch_ptr;
    float** h_device_batch_ptr = new float*[batch_size];
    
    cudaMalloc((void**)&device_batch_ptr, batch_size*sizeof(float *));
    size_t nelementsperrarray = N * N;
    for(int i = 0 ; i < batch_size; i++ ) {
        // set h_device_batch_ptr[i] to starting point of i'th batch on device memory array.
        h_device_batch_ptr[i] = device_array + i * nelementsperarray;
    }
    cudaMemcpy(device_batch_ptr, h_device_batch_ptr, batch_size*sizeof(float *)),
                cudaMemcpyHostToDevice);
    

    [Obviously never compiled or tested, use at own risk]