Getting pointers to specific elements of a 1D contiguous array on the device

I am trying to use CUBLAS in C++ to rewrite a python/tensorflow script which is operating on batches of input samples (of shape BxD, B: BatchSize, D: Depth of the flattened 2D matrix)

For the first step, I decided to use CUBLAS cublasSgemmBatched to compute MatMul for batches of matrices.

I've found couple working sample codes as the one in link to the question, but what I want is to allocate one big contiguous device array to store batches of flattened identical shaped matrices. I DO NOT want to store batches separated from each other on device memory(as they are in the provided sample code in the given link to StackOverflow question)

From what I can imagine, somehow I have to get a list of pointers to starting elements of each batch on device memory. something like this:

float **device_batch_ptr;
cudaMalloc((void**)&device_batch_ptr, batch_size*sizeof(float *));
for(int i = 0 ; i < batch_size; i++ ) {
    // set device_batch_ptr[i] to starting point of i'th batch on device memory array.
}

Note that cublasSgemmBatched needs a float** that each float* in it, points to starting element of each batch in a given input matrix.

Any advice and suggestions will be greatly appreciated.

Solution

If your arrays are in contiguous linear memory (device_array) then all you need to do is calculate the offsets using standard pointer arithmetic and store the device addresses in a host array which you then copy to the device. Something like:

float** device_batch_ptr;
float** h_device_batch_ptr = new float*[batch_size];

cudaMalloc((void**)&device_batch_ptr, batch_size*sizeof(float *));
size_t nelementsperrarray = N * N;
for(int i = 0 ; i < batch_size; i++ ) {
    // set h_device_batch_ptr[i] to starting point of i'th batch on device memory array.
    h_device_batch_ptr[i] = device_array + i * nelementsperarray;
}
cudaMemcpy(device_batch_ptr, h_device_batch_ptr, batch_size*sizeof(float *)),
            cudaMemcpyHostToDevice);

[Obviously never compiled or tested, use at own risk]