Search code examples
openclmulti-gpu

OpenCL MultiGPU slower than single GPU


I am developing an application which performs some processing on video frame data. To accelerate it I use 2 graphic cards and process the data with OpenCL. My idea is to send one frame to the first card and another one to the second card. The devices use the same context, but different command queues, kernels and memory objects.

However, it seems to me that the computations are not executed in parallel, because the time required by the 2 cards is almost the same as the time required by only one graphic card.

Does anyone have a good example of using multiple devices on independant data pieces simultaneously?

Thanks in advance.

EDIT:

Here is the resulting code after switching to 2 separate contexts. However, the execution time with 2 graphic cards still remains the same as with 1 graphic card.

    cl::NDRange globalws(imageSize);
    cl::NDRange localws;

    for (int i = 0; i < numDevices; i++){
            // Copy the input data to the device
            commandQueues[i].enqueueWriteBuffer(inputDataBuffer[i], CL_TRUE, 0, imageSize*sizeof(float), wt[i].data);

            // Set kernel arguments
            kernel[i].setArg(0, inputDataBuffer[i]);

            kernel[i].setArg(1, modulusBuffer[i]);
            kernel[i].setArg(2, imagewidth);
        }

        for (int i = 0; i < numDevices; i++){
            // Run kernel
            commandQueues[i].enqueueNDRangeKernel(kernel[i], cl::NullRange, globalws, localws);
        }

        for (int i = 0; i < numDevices; i++){
            // Read the modulus back to the host
            float* modulus = new float[imageSize/4];
            commandQueues[i].enqueueReadBuffer(modulusBuffer[i], CL_TRUE, 0, imageSize/4*sizeof(float), modulus);

            // Do something with the modulus;
        }

Solution

  • Your main problem is that you are using blocking calls. It doesn't matter how many devices you have, if you operate them in that way. Since you are doing an operation and waiting for it to finish, so no parallelization at all (or very little). You are doing this at the moment:

    Wr:-Copy1--Copy2--------------------
    G1:---------------RUN1--------------
    G2:---------------RUN2--------------
    Re:-------------------Read1--Read2--
    

    You should change your code to do it like this at least:

    Wr:-Copy1-Copy2-----------
    G1:------RUN1-------------
    G2:------------RUN2-------
    Re:----------Read1-Read2--
    

    With this code:

    cl::NDRange globalws(imageSize);
    cl::NDRange localws;
    
    for (int i = 0; i < numDevices; i++){
            // Set kernel arguments //YOU SHOULD DO THIS AT INIT STAGE, IT IS SLOW TO DO IT IN A LOOP
            kernel[i].setArg(0, inputDataBuffer[i]);
    
            kernel[i].setArg(1, modulusBuffer[i]);
            kernel[i].setArg(2, imagewidth);
    
            // Copy the input data to the device
            commandQueues[i].enqueueWriteBuffer(inputDataBuffer[i], CL_FALSE, 0, imageSize*sizeof(float), wt[i].data);
        }
    
        for (int i = 0; i < numDevices; i++){
            // Run kernel
            commandQueues[i].enqueueNDRangeKernel(kernel[i], cl::NullRange, globalws, localws);
        }
    
        float* modulus[numDevices];
        for (int i = 0; i < numDevices; i++){
            // Read the modulus back to the host
            modulus[i] = new float[imageSize/4];
            commandQueues[i].enqueueReadBuffer(modulusBuffer[i], CL_FALSE, 0, imageSize/4*sizeof(float), modulus[i]);
        }
    
        clFinish();
    
            // Do something with the modulus;
    

    Regarding the comments to have multiple contexts, depends if you are ever going to comunicate both GPUs or not. As long as the GPUs only use their memory, theere will be no copy overhead. But if you set/unset kernel args constantly, that will trigger copys to the other GPUs. So, be careful with that.

    The safer approach for a non-comunication between GPUs are different contexts.


    I suspect your main problem is the memory copy and not the kernel execution, highly likely 1 GPU will fulfil your needs if you hide the memory latency:

    Wr:-Copy1-Copy2-Copy3----------
    G1:------RUN1--RUN2--RUN3------
    Re:----------Read1-Read2-Read3-