Search code examples
cudasynchronizationcuda-streams

How can I make sure two kernels in two streams are sent to the GPU at the same time to run?


I am beginner in CUDA. I am using NVIDIA Geforce GTX 1070 and CUDA toolkit 11.3 and ubuntu 18.04. As shown in the code below, I use two CPU threads to send two kernels in the form of two streams to a GPU. I want exactly these two kernels to be sent to the GPU at the same time. Is there a way to do this?

Or at least better than what I did.

Thank you in advance.

My code:

//Headers
pthread_cond_t cond;
pthread_mutex_t cond_mutex;
unsigned int waiting;
cudaStream_t streamZero, streamOne;  

//Kernel zero defined here
__global__ void kernelZero(){...}

//Kernel one defined here
__global__ void kernelOne(){...}

//This function is defined to synchronize two threads when sending kernels to the GPU.
void threadsSynchronize(void) {
    pthread_mutex_lock(&cond_mutex);
    if (++waiting == 2) {
        pthread_cond_broadcast(&cond);
    } else {
        while (waiting != 2)
            pthread_cond_wait(&cond, &cond_mutex);
    }
    pthread_mutex_unlock(&cond_mutex);
}


void *threadZero(void *_) {
    // ...
    threadsSynchronize();
    kernelZero<<<blocksPerGridZero, threadsPerBlockZero, 0, streamZero>>>();
    cudaStreamSynchronize(streamZero);
    // ...
    return NULL;
}


void *threadOne(void *_) {
    // ...
    threadsSynchronize();
    kernelOne<<<blocksPerGridOne, threadsPerBlockOne, 0, streamOne>>>();
    cudaStreamSynchronize(streamOne);
    // ...
    return NULL;
}


int main(void) {
    pthread_t zero, one;
    cudaStreamCreate(&streamZero);
    cudaStreamCreate(&streamOne); 
    // ...
    pthread_create(&zero, NULL, threadZero, NULL);
    pthread_create(&one, NULL, threadOne, NULL);
    // ...
    pthread_join(zero, NULL);
    pthread_join(one, NULL);
    cudaStreamDestroy(streamZero);  
    cudaStreamDestroy(streamOne);  
    return 0;
}

Solution

  • Actually witnessing concurrent kernel behavior on a GPU has a number of requirements which are covered in other questions here on the SO cuda tag, so I'm not going to cover that ground.

    Let's assume your kernels have the possibility to run concurrently.

    In that case, you're not going to do any better than this, whether you use threading or not:

    cudaStream_t s1, s2;
    cudaStreaCreate(&s1);
    cudaStreamCreate(&s2);
    kernel1<<<...,s1>>>(...);
    kernel2<<<...,s2>>>(...);
    

    If your kernels have a "long" duration (much longer than the kernel launch overhead, approximately 5-50us) then they will appear to start at "nearly" the same time. You won't do better than this by switching to threading. The reason for this is not published as far as I know, so I will simply say that my own observations suggest to me that kernel launches to the same GPU are serialized by the CUDA runtime, somehow. You can find anecdotal evidence of this on various forums, and its fine if you don't believe me. There's also no reason to assume, with CPU threading mechanisms that I am familiar with, that CPU threads execute in lockstep. Therefore there is no reason to assume that a threading system will cause the kernel launch in two different threads to even be reached by the host threads at the same instant in time.

    You might do a small amount better by using the cudaLaunchKernel for kernel launch, rather than the triple-chevron launch syntax: <<<...>>>, but there really is no documentation to support this claim. YMMV.

    Keep in mind that the GPU is doing its best work as a throughput processor. There are no explicit mechanisms to ensure simultaneous kernel launch, and its unclear why you would need that.