Search code examples
cudagpgpumulti-gpu

Code running on two GPUs does not reach concurrent execution and has irrelevant speedup


I have code like this:

for(int i =0; i<2; i++)
{
    //initialization of memory and some variables
    ........
    ........
    RunDll(input image, output image); //function that calls kernel
}

Each iteration in the above loop is independent. I want to run them concurrently. So, I tried this:

for(int i =0; i<num_devices; i++)
{
    cudaSetDevice(i);
    //initialization of memory and some variables
    ........
    ........
    RunDll(input image, output image); 
    {
        RunBasicFBP_CUDA(parameters); //function that calls kernel 1

        xSegmentMetal(parameters); //CPU function

        RunBasicFP_CUDA(parameters);  //function that uses output of kernel 1 as input for kernel 2

        for (int idx_view = 0; idx_view < param.fbp.num_view; idx_view++)
        {
            for (int idx_bin = 1; idx_bin < param.fbp.num_bin-1; idx_bin++)
            {
                sino_diff[idx_view][idx_bin] = sino_org[idx_view][idx_bin] - sino_mask[idx_view][idx_bin];
            }
        }

        RunBasicFP_CUDA(parameters);
        if(some condition)
        {
            xInterpolateSinoLinear(parameters);  //CPU function
        }
        else
        {
            xInterpolateSinoPoly(parameters);  //CPU function
        }

        RunBasicFBP_CUDA( parameters );
    }
}

I am using 2 GTX 680 and I want to use these two devices concurrently. With the above code, I am not getting any speed-up. The processing time is almost the same as that when running on a single GPU.

How can I reach concurrent execution on the two available devices?


Solution

  • In your comment you say:

    RunDll has two kernels and they are being launched one by one. Kernels do have cudaThreadSynchronize()

    Note that cudaThreadSynchronize() is equivalent to cudaDeviceSynchronize() (and the former is actually deprecated) which means that you will run on one GPU, synchronise, then run on the other GPU. Also note that cudaMemcpy() is a blocking routine, you would need the cudaMemcpyAsync() version to avoid all blocking (as pointed out by @JackOLantern in comments).

    In general, you will need to post more details of what is inside RunDLL() since without that your questions does not have enough information to give a definitive answer. Ideally follow these guidelines.