Search code examples
carrayscudanvidiamulti-gpu

Iterate on 1D array using CUDA in multi-GPU system


I've been studying parallel programming in the last couple of months and now I am trying to adapt my application to a multi-GPUs platform. The problem is that I still do not understand very well how I can iterate through the array using multiple GPUs.

Do I need do divide my main array into smaller sub-arrays and send each one to each GPU or there is a way of make each GPU iterate in a fragment of the array? I have the serial and single-GPU version of this application working and I've been trying to use different methods to solve this problem and adapt it to the multi-GPUs but none of them return the same results as the two previous versions. I do not know what more I can do, so my conclusion is that I am not understanding how to iterate through the array in the multi-GPU system. Can someone help me, please?

My code runs N iterations, and in each iteration it go through each value in my array (that represents an grid) and calculate a new value for it.

This is a sketch of how my code looks like right now:

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>

#define DIM     24
#define BLOCK_SIZE 16
#define SRAND_VALUE 585

__global__ void random(int* t, int* newT){

    int iy = blockDim.y * blockIdx.y + threadIdx.y + 1;
    int ix = blockDim.x * blockIdx.x + threadIdx.x + 1;
    int id = iy * (dim+2) + ix;

    if (iy <= DIM && ix <= DIM) {
        if (t[id] % 2 == 0)
            newT[id] = t[id]*3;
        else
            newT[id] = t[id]*5;
    }
}

int main(int argc, char* argv[]){
    int i,j, devCount;
    int *h_test, *d_test, *d_tempTest, *d_newTest;
    size_t gridBytes;

    cudaGetDeviceCount(&devCount);

    gridBytes = sizeof(int)*(DIM)*(DIM);
    h_test = (int*)malloc(gridBytes);

    srand(SRAND_VALUE);
    #pragma omp parallel for private(i,j)
        for(i = 1; i<=DIM;i++) {
            for(j = 1; j<=DIM; j++) {
                h_test[i*(DIM)+j] = rand() % 2;
            }
        }

    if (devCount == 0){
        printf("There are no devices in this machine!");
        return 1; // if there is no GPU, then break the code
    }

    dim3 blockSize(BLOCK_SIZE, BLOCK_SIZE,1);
    int  linGrid = (int)ceil(DIM/(float)BLOCK_SIZE);
    dim3 gridSize(linGrid,linGrid,1);

    dim3 cpyBlockSize(BLOCK_SIZE,1,1);
    dim3 cpyGridRowsGridSize((int)ceil(DIM/(float)cpyBlockSize.x),1,1);
    dim3 cpyGridColsGridSize((int)ceil((DIM+2)/(float)cpyBlockSize.x),1,1);

    else if (devCount == 1){

        cudaMalloc(&d_test, gridBytes);
        cudaMalloc(&d_tempTest, gridBytes);
        cudaMalloc(&d_newTest, gridBytes);

        cudaMemcpy(d_test, h_test, gridBytes, cudaMemcpyHostToDevice);

        for (iter = 0; iter < DIM; iter ++){
            random<<<gridSize, blockSize>>>(d_test, d_newTest);

            d_tempTest = d_test;
            d_test = d_newTest;
            d_newTest = d_tempTest;
        }

        cudaMemcpy(h_test, d_test, gridBytes, cudaMemcpyDeviceToHost);

        return 0;
    }

    else{
        int nThreads, tId, current;
        omp_set_num_threads(devCount);

        for (iter = 0; iter < DIM; iter ++){

            #pragma omp parallel private(tId, h_subGrid, ) shared(h_grid, gridBytes)
            {
                tId = omp_get_thread_num();
                cudaSetDevice(tId);

                cudaMalloc(&d_test, gridBytes);
                cudaMalloc(&d_tempTest, gridBytes);
                cudaMalloc(&d_newTest, gridBytes);

                cudaMemcpy(d_grid, h_grid, gridBytes, cudaMemcpyHostToDevice);

                ******// What do I do here//******

            } 
        }
        return 0;
    }
}

Thanks in advance.


Solution

  • The short answer: Yes, you should divide your array into subarrays for each GPU.

    Details: Each GPU has its own memory. In your code you allocate memory for the whole array on each GPU and copy the whole array to each GPU. Now you could operate on a subset of the array. But when you want to copy back you need to ensure that you copy only the updated parts of each array. The better way from the beginning would be to copy only the part of the array that you want to update on the specific GPU.

    Solution: Modify the multiGPU part to something like the following (you need to ensure that you don't miss elements if gridBytes%devCount != 0, my code snippet does not check this)

    int gridBytesPerGPU = gridBytes/devCount;
    cudaMalloc(&d_test, gridBytesPerGPU);
    cudaMalloc(&d_newTest, gridBytesPerGPU );
    
    cudaMemcpy(d_test, &h_test[tId*gridBytesPerGPU], gridBytesPerGPU, cudaMemcpyHostToDevice); // copy only the part of the array that you want to use on that GPU
    // do the calculation
    cudaMemcpy(&h_test[tId*gridBytesPerGPU], d_newTest, gridBytesPerGPU, cudaMemcpyDeviceToHost);
    

    Now you only need to calculate the appropriate block and grid size. See (c) below. If you have problems with that part then please ask in the comment and I will extend this answer.

    Apart from that there are some parts in your code that I do not understand:

    a) Why do you need to swap the pointers?

    b) You run the kernel part multiple times but the code in the for loop does not depend on the counter. Why? What do I miss?

    for (iter = 0; iter < DIM; iter ++){
        random<<<gridSize, blockSize>>>(d_test, d_newTest);
    
        d_tempTest = d_test;
        d_test = d_newTest;
        d_newTest = d_tempTest;
    }
    

    c) The calculation of grid and block size for this simple kernel looks a bit complicated (I skipped it when reading your question). I would consider the problem as a one dimensional one, then everything will look much simpler including your kernel.