Launching asynchronous memory copy opeerations on multiple-GPUs

I want to divide & copy an array of data on host to device memories of multiple gpus. Also, I want to do all these copy operations concurrently.

For this I am using cudaMemcpyAsync which I launch in private stream of each GPU.

Here is what I am doing (Doubts in code are marked with comments starting with ?? )

#define SIZE 1000
#define GPUCOUNT 2

int* hostData = nullptr;
int *devData[GPUCOUNT];
cudaStream_t stream[GPUCOUNT];

// Create one stream per GPU
for ( int i=0; i != GPUCOUNT ; ++i )
{    
    // DO I need to call cudaSetDevice before creating stream for each GPU ??
    cudaStreamCreate(&stream[i]));
}

// Allocate pinned data on host
cudaMallocHost (&hostData, SIZE );

// Allocate data on each device and copy part of host data to it
for( int i=0; i != GPUCOUNT ; ++i )
{
   cudaSetDevice(i);
   cudaMalloc( (void**) &devData[i], sizeof(int) * SIZE/GPUCOUNT  ); // ?? Does blocking behavior of cudamalloc prevents asynch memcpy invoked in stream of other GPUs from running concurrently 
   cudaMemcpyAsync( (void*) devData[i], hostData + i*SIZE/GPUCOUNT, SIZE/GPUCOUNT, cudaMemcpyHostToDevice, stream[i] );
}

// Some CPU code while copy is happening
// ....

// Wait for copy on all streams to finish
cudaDeviceSynchronize();

// Do something else

As I read the C Programming guide, I see that the above mem copy operations would not happen asynchronously because in between two consecutive async memory copy launches I am invoking a host operation which allocates device memory (blocking call).

3.2.5.5.4. Implicit Synchronization

Two commands from different streams cannot run concurrently if any one of the following operations is issued in-between them by the host thread:

‣ a page-locked host memory allocation,

‣ a device memory allocation,

‣ a device memory set,

‣ a memory copy between two addresses to the same device memory,

‣ any CUDA command to the default stream,

If the above reason seems to be true, then I need to split my memory allocation and copy operation

// Allocate data on each device 
for( int i=0; i != GPUCOUNT ; ++i )
{
   cudaSetDevice(i);
   cudaMalloc( (void**) &devData[i], sizeof(int) * SIZE/GPUCOUNT  );
}

// Copy part of host data to each device
for( int i=0; i != GPUCOUNT ; ++i )
{
   // ?? DO I need to call cudaSetDevice before memory copy ??
   // CUDA guide says:"A memory copy will succeed even if it is issued to a stream that is not associated to the current device."

   cudaMemcpyAsync( (void*) devData[i], hostData + i*SIZE/GPUCOUNT, SIZE/GPUCOUNT, cudaMemcpyHostToDevice, stream[i] ); 
}

Is my above analysis valid ?

Also, Is it not possible to do this without creating an explicit per gpu stream by launching cudaMemcpyAsync operation in the default stream(stream id 0) of each GPU ?. I am basing this on following claims made in CUDA C programming guide:

Each device has its own default stream (see Default Stream), so commands issued to the default stream of a device may execute out of order or concurrently with respect to commands issued to the default stream of any other device.

The code would then look like this

#define SIZE 1000
#define GPUCOUNT 2

int* hostData = nullptr;
int *devData[GPUCOUNT];

// Allocate pinned data on host
cudaMallocHost (&hostData, SIZE );

// Allocate data on each device
for( int i=0; i != GPUCOUNT ; ++i )
{
   cudaSetDevice(i);
   cudaMalloc( (void**) &devData[i], sizeof(int) * SIZE/GPUCOUNT  );
}

// Copy part of host data to each device
for( int i=0; i != GPUCOUNT ; ++i )
{
   // ?? DO I need to call cudaSetDevice before memory copy ??
   // CUDA guide says:"A memory copy will succeed even if it is issued to a stream that is not associated to the current device."

   cudaMemcpyAsync( (void*) devData[i], hostData + i*SIZE/GPUCOUNT, SIZE/GPUCOUNT, cudaMemcpyHostToDevice, 0 ); 
}

// Some CPU code while copy is happening
// ....

// Wait for copy on all streams to finish
cudaDeviceSynchronize();

// Do something else

Solution

http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/group__CUDART__DEVICE_g418c299b069c4803bfb7cab4943da383.html

cudaError_t cudaSetDevice   (   int     device   )

Sets device as the current device for the calling host thread.

Any device memory subsequently allocated from this host thread using cudaMalloc(), cudaMallocPitch() or cudaMallocArray() will be physically resident on device. Any host memory allocated from this host thread using cudaMallocHost() or cudaHostAlloc() or cudaHostRegister() will have its lifetime associated with device. Any streams or events created from this host thread will be associated with device. Any kernels launched from this host thread using the <<<>>> operator or cudaLaunch() will be executed on device.

This call may be made from any host thread, to any device, and at any time. This function will do no synchronization with the previous or new device, and should be considered a very low overhead call.

Looks like set device will do everything you need without streams. You should be able to flip through each device, use their default stream and make your calls to malloc and memcpy. Use of async memcpy, and streamed based kernel calls will assist in concurrent memory transfers and kernel calls on a device.

You do need to call setdevice before all calls for that device. Streams will not assist with this.