Search code examples
concurrencycudathrustcuda-streams

Thrust execution policy issues kernel to default stream


I am currently designing a short tutorial exhibiting various aspects and capabilities of Thrust template library.

Unfortunately, it seems that there is a problem in a code that I have written in order to show how to use copy/compute concurrency using cuda streams.

My code could be found here, in the asynchronousLaunch directory: https://github.com/gnthibault/Cuda_Thrust_Introduction/tree/master/AsynchronousLaunch

Here is an abstract of the code that generates the problem:

//STL
#include <cstdlib>
#include <algorithm>
#include <iostream>
#include <vector>
#include <functional>

//Thrust
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/scan.h>

//Cuda
#include <cuda_runtime.h>

//Local
#include "AsynchronousLaunch.cu.h"

int main( int argc, char* argv[] )
{
    const size_t fullSize = 1024*1024*64;
    const size_t halfSize = fullSize/2;

    //Declare one host std::vector and initialize it with random values
    std::vector<float> hostVector( fullSize );
    std::generate(hostVector.begin(), hostVector.end(), normalRandomFunctor<float>(0.f,1.f) );

    //And two device vector of Half size
    thrust::device_vector<float> deviceVector0( halfSize );
    thrust::device_vector<float> deviceVector1( halfSize );

    //Declare  and initialize also two cuda stream
    cudaStream_t stream0, stream1;
    cudaStreamCreate( &stream0 );
    cudaStreamCreate( &stream1 );

    //Now, we would like to perform an alternate scheme copy/compute
    for( int i = 0; i < 10; i++ )
    {
        //Wait for the end of the copy to host before starting to copy back to device
        cudaStreamSynchronize(stream0);
        //Warning: thrust::copy does not handle asynchronous behaviour for host/device copy, you must use cudaMemcpyAsync to do so
        cudaMemcpyAsync(thrust::raw_pointer_cast(deviceVector0.data()), thrust::raw_pointer_cast(hostVector.data()), halfSize*sizeof(float), cudaMemcpyHostToDevice, stream0);
        cudaStreamSynchronize(stream1);
        //second copy is most likely to occur sequentially after the first one
        cudaMemcpyAsync(thrust::raw_pointer_cast(deviceVector1.data()), thrust::raw_pointer_cast(hostVector.data())+halfSize, halfSize*sizeof(float), cudaMemcpyHostToDevice, stream1);

        //Compute on device, here inclusive scan, for histogram equalization for instance
        thrust::transform( thrust::cuda::par.on(stream0), deviceVector0.begin(), deviceVector0.end(), deviceVector0.begin(), computeFunctor<float>() );
        thrust::transform( thrust::cuda::par.on(stream1), deviceVector1.begin(), deviceVector1.end(), deviceVector1.begin(), computeFunctor<float>() );

        //Copy back to host
        cudaMemcpyAsync(thrust::raw_pointer_cast(hostVector.data()), thrust::raw_pointer_cast(deviceVector0.data()), halfSize*sizeof(float), cudaMemcpyDeviceToHost, stream0);
        cudaMemcpyAsync(thrust::raw_pointer_cast(hostVector.data())+halfSize, thrust::raw_pointer_cast(deviceVector1.data()), halfSize*sizeof(float), cudaMemcpyDeviceToHost, stream1);
    }

    //Full Synchronize before exit
    cudaDeviceSynchronize();

    cudaStreamDestroy( stream0 );
    cudaStreamDestroy( stream1 );

    return EXIT_SUCCESS;
}

Here are the results of one instance of the program, observed through nvidia visual profile:

Kernels are issued to default stream

As yo can see, cudamemcopy (in brown) are both issued to stream 13 and 14, but kernels generated by Thrust from thrust::transform are issued to default stream (in blue in the capture)

By the way, I am using cuda toolkit version 7.0.28, with a GTX680 and gcc 4.8.2.

I would be grateful if someone could tell me what is wrong with my code.

Thank you in advance

Edit: here is the code that I consider as a solution:

//STL
#include <cstdlib>
#include <algorithm>
#include <iostream>
#include <functional>
#include <vector>


//Thrust
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/execution_policy.h>


//Cuda
#include <cuda_runtime.h>

//Local definitions

template<typename T>
struct computeFunctor
{
    __host__ __device__
    computeFunctor() {}

    __host__ __device__
    T operator()( T in )
    {
        //Naive functor that generates expensive but useless instructions
        T a =  cos(in);
        for(int i = 0; i < 350; i++ )
        {
            a+=cos(in);
        }
        return a;
    }
};

int main( int argc, char* argv[] )
{
    const size_t fullSize =  1024*1024*2;
    const size_t nbOfStrip = 4;
    const size_t stripSize =  fullSize/nbOfStrip;

    //Allocate host pinned memory in order to use asynchronous api and initialize it with random values
    float* hostVector;
    cudaMallocHost(&hostVector,fullSize*sizeof(float));
    std::fill(hostVector, hostVector+fullSize, 1.0f );

    //And one device vector of the same size
    thrust::device_vector<float> deviceVector( fullSize );

    //Declare  and initialize also two cuda stream
    std::vector<cudaStream_t> vStream(nbOfStrip);
    for( auto it = vStream.begin(); it != vStream.end(); it++ )
    {
        cudaStreamCreate( &(*it) );
    }

    //Now, we would like to perform an alternate scheme copy/compute in a loop using the copyToDevice/Compute/CopyToHost for each stream scheme:
    for( int i = 0; i < 5; i++ )
    {
        for( int j=0; j!=nbOfStrip; j++)
        {
            size_t offset = stripSize*j;
            size_t nextOffset = stripSize*(j+1);
            cudaStreamSynchronize(vStream.at(j));
            cudaMemcpyAsync(thrust::raw_pointer_cast(deviceVector.data())+offset, hostVector+offset, stripSize*sizeof(float), cudaMemcpyHostToDevice, vStream.at(j));
            thrust::transform( thrust::cuda::par.on(vStream.at(j)), deviceVector.begin()+offset, deviceVector.begin()+nextOffset, deviceVector.begin()+offset, computeFunctor<float>() );
            cudaMemcpyAsync(hostVector+offset, thrust::raw_pointer_cast(deviceVector.data())+offset, stripSize*sizeof(float), cudaMemcpyDeviceToHost, vStream.at(j));
        }
    }
    //On devices that do not possess multiple queues copy engine capability, this solution serializes all command even if they have been issued to different streams
    //Why ? Because in the point of view of the copy engine, which is a single ressource in this case, there is a time dependency between HtoD(n) and DtoH(n) which is ok, but there is also
    // a false dependency between DtoH(n) and HtoD(n+1), that preclude any copy/compute overlap

    //Full Synchronize before testing second solution
    cudaDeviceSynchronize();

    //Now, we would like to perform an alternate scheme copy/compute in a loop using the copyToDevice for each stream /Compute for each stream /CopyToHost for each stream scheme:
    for( int i = 0; i < 5; i++ )
    {
        for( int j=0; j!=nbOfStrip; j++)
        {
            cudaStreamSynchronize(vStream.at(j));
        }
        for( int j=0; j!=nbOfStrip; j++)
        {
            size_t offset = stripSize*j;
            cudaMemcpyAsync(thrust::raw_pointer_cast(deviceVector.data())+offset, hostVector+offset, stripSize*sizeof(float), cudaMemcpyHostToDevice, vStream.at(j));
        }
        for( int j=0; j!=nbOfStrip; j++)
        {
            size_t offset = stripSize*j;
            size_t nextOffset = stripSize*(j+1);
            thrust::transform( thrust::cuda::par.on(vStream.at(j)), deviceVector.begin()+offset, deviceVector.begin()+nextOffset, deviceVector.begin()+offset, computeFunctor<float>() );

        }
        for( int j=0; j!=nbOfStrip; j++)
        {
            size_t offset = stripSize*j;
            cudaMemcpyAsync(hostVector+offset, thrust::raw_pointer_cast(deviceVector.data())+offset, stripSize*sizeof(float), cudaMemcpyDeviceToHost, vStream.at(j));
        }
    }
    //On device that do not possess multiple queues in the copy engine, this solution yield better results, on other, it should show nearly identic results

    //Full Synchronize before exit
    cudaDeviceSynchronize();

    for( auto it = vStream.begin(); it != vStream.end(); it++ )
    {
        cudaStreamDestroy( *it );
    }
    cudaFreeHost( hostVector );

    return EXIT_SUCCESS;
}

Compiled using nvcc ./test.cu -o ./test.exe -std=c++11


Solution

  • There are 2 things I would point out. Both of these are (now) referenced in this related question/answer which you may wish to refer to.

    1. The failure of thrust to issue the underlying kernels to non-default streams in this case seems to be related to this issue. It can be rectified (as covered in the comments to the question) by updating to the latest thrust version. Future CUDA versions (beyond 7) will probably include a fixed thrust as well. This is probably the central issue being discussed in this question.

    2. The question seems to also suggest that one of the goals is overlap of copy and compute:

      in order to show how to use copy/compute concurrency using cuda streams
      

      but this won't be achievable, I don't think, with the code as currently crafted, even if item 1 above is fixed. Overlap of copy with compute operations requires the proper use of cuda streams on the copy operation (cudaMemcpyAsync) as well as a pinned host allocation. The code proposed in the question is lacking any use of a pinned host allocation (std::vector does not use a pinned allocator by default, AFAIK), and so I would not expect the cudaMemcpyAsync operation to overlap with any kernel activity, even if it should be otherwise possible. To rectify this, a pinned allocator should be used, and one such example is given here.

    For completeness, the question is otherwise lacking an MCVE, which is expected for questions of this type. This makes it more difficult for others to attempt to test your issue, and is explicitly a close reason on SO. Yes, you provided a link to an external github repo, but this behavior is frowned on. The MCVE requirement explicitly states that the necessary pieces should be included in the question itself (not an external reference.) Since the only lacking piece, AFAICT, is "AsynchronousLaunch.cu.h", it seems like it would have been relatively straightforward to include this one additional piece in your question. The problem with external links is that when they break in the future, the question becomes less useful for future readers. (And, forcing others to navigate an external github repo looking for specific files is not conducive to getting help, in my opinion.)