CUDA: Send data from GPU to GPU

I have two GPU cards Tesla C2070 (P2P & UAV support) that I want to Send and Receive data using CUDA.

In GPU A, I have a matrix:
a11 a12 a13 a14
a21 a22 a23 a24
In GPU B, I have another matrix:
b11 b12 b13 b14
b21 b22 b23 b24

I can only send contiguous elements as the code below:

int main(void) 
{

    float *d_a, *d_b;
    int N = 4;
    int M = 2;
    size_t pitch;   

    cudaSetDevice(0);   
    cudaMallocPitch(&d_a, &pitch, sizeof(float)*N, M);
    cudaDeviceEnablePeerAccess(1, 0);

    cudaSetDevice(1);       
    cudaMallocPitch(&d_b, &pitch, sizeof(float)*N, M);
    cudaDeviceEnablePeerAccess(0, 0);

    //Initialization for d_a
    //Initialization for d_b

    //Copy M*N/2 element from d_a to d_b, starting from d_a[1]
    cudaMemcpy(&d_b[1], &d_a[1], M*N/2*sizeof(float), cudaMemcpyDefault);

    //Print result d_b          
}

How to send the last two columns of the matrix from GPU A to GPU B directly, so on GPU B I will get:
b11 b12 a13 a14
b21 b22 a23 a24

Similarly, how to send the first row of the matrix from GPU A to GPU B, so on GPU B I will get:
a11 a12 a13 a14
b21 b22 b23 b24

If I have 1-D array as follow: a1 a2 a3 a4 a5 a6 a7 a8.....
How to send elements 1,4,7,...(every 3 elements) from GPU A to replace the same ones on GPU B?

Solution

The API call you need to look at is cudaMemcpy2D. This allows fairly straightforward copying of all or portions of pitched data, and is the natural counterpart of cudaMallocPitch.

If we leave aside the multiGPU aspect of your question for a moment, and just focus on the copying of pitched data (in UVA platforms, how GPU to GPU transfers are handled is basically an implementation detail you don't need to know about), there are only three things required to do what you want:

Use pointer arithmetic to calculate the starting address of source and destination memory
Remember that the pitch of the source and destination memory is always constant (that returned by cudaMallocPitch). Note you should keep a pitch for each pointer you allocate. There is no guarantee that the API will return the same pitch for two different allocations of the same size, this is particularly true if the allocations are not on the same device
Remember that you need to calculate the width of any transfer in bytes, and the number widths is always a count, not a byte value.

Here is a concrete example based off the code you posted which performs copying of a subset of data between two pitched allocations assuming column major order. Note that for brevity, I have encapsulated most of the addressing mechanics in a simple class which can be used on both the host and device. Two 5x10 pitched arrays are allocated, and a 3x3 sub array is copied from one to the other. I have used kernel printf to show the copying action:

#include <cstdio>

struct mat
{
    int m, n;
    size_t pitch;
    char *ptr;

    __device__ __host__
    mat(int _m, int _n, size_t _pitch, char *_ptr) : m(_m), n(_n), pitch(_pitch), ptr(_ptr) {};

    __device__ __host__ float * getptr(int i=0, int j=0) {
        float * col = (float*)(ptr + j*pitch);
        return col + i;
    };

    __device__ __host__ float& operator() (int i, int j) { 
        return *getptr(i,j);
    };

    __device__ __host__
    void print() {
        for(int i=0; i<m; i++) {
            for(int j=0; j<n; j++) {
                printf("%4.f ", (*this)(i,j));
            }
            printf("\n");
        }
    };
};

__global__ void printmat(struct mat x) { x.print(); }

int main(void) 
{

    const int M = 5, N = 10;
    const size_t hostpitch = M * sizeof(float);

    float *a = new float[M*N], *b = new float[M*N];
    mat A(M, N, hostpitch, (char *)(a));
    mat B(M, N, hostpitch, (char *)(b));
    for(int v=0, j=0; j<N; j++) {
        for(int i=0; i<M; i++) {
            A(i,j) = (float)v; B(i,j) = (float)(100+v++);
        }
    }

    char *d_a, *d_b;
    size_t pitch_a, pitch_b;   
    cudaMallocPitch((void **)&d_a, &pitch_a, sizeof(float)*M, N);
    cudaMallocPitch((void **)&d_b, &pitch_b, sizeof(float)*M, N);
    mat Ad(M, N, pitch_a, d_a); mat Bd(M, N, pitch_b, d_b);

    cudaMemcpy2D(Ad.getptr(), Ad.pitch, A.getptr(), A.pitch, 
            A.pitch, A.n, cudaMemcpyHostToDevice);
    printmat<<<1,1>>>(Ad);

    cudaMemcpy2D(Bd.getptr(), Bd.pitch, B.getptr(), B.pitch, 
            B.pitch, B.n, cudaMemcpyHostToDevice);
    printmat<<<1,1>>>(Bd);

    int ci = 3, cj = 3;
    cudaMemcpy2D(Ad.getptr(1,1), Ad.pitch, Bd.getptr(1,1), Bd.pitch, 
            ci*sizeof(float), cj, cudaMemcpyDeviceToDevice);
    printmat<<<1,1>>>(Ad); cudaDeviceSynchronize();

    return 0;
}

which does this:

>nvcc -m32 -Xptxas="-v" -arch=sm_21 pitched.cu
pitched.cu
tmpxft_00001348_00000000-5_pitched.cudafe1.gpu
tmpxft_00001348_00000000-10_pitched.cudafe2.gpu
pitched.cu
ptxas : info : 0 bytes gmem, 8 bytes cmem[2]
ptxas : info : Compiling entry function '_Z8printmat3mat' for 'sm_21'
ptxas : info : Function properties for _Z8printmat3mat
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 23 registers, 48 bytes cmem[0]
tmpxft_00001348_00000000-5_pitched.cudafe1.cpp
tmpxft_00001348_00000000-15_pitched.ii

>cuda-memcheck a.exe
========= CUDA-MEMCHECK
   0    5   10   15   20   25   30   35   40   45
   1    6   11   16   21   26   31   36   41   46
   2    7   12   17   22   27   32   37   42   47
   3    8   13   18   23   28   33   38   43   48
   4    9   14   19   24   29   34   39   44   49
 100  105  110  115  120  125  130  135  140  145
 101  106  111  116  121  126  131  136  141  146
 102  107  112  117  122  127  132  137  142  147
 103  108  113  118  123  128  133  138  143  148
 104  109  114  119  124  129  134  139  144  149
   0    5   10   15   20   25   30   35   40   45
   1  106  111  116   21   26   31   36   41   46
   2  107  112  117   22   27   32   37   42   47
   3  108  113  118   23   28   33   38   43   48
   4    9   14   19   24   29   34   39   44   49
========= ERROR SUMMARY: 0 errors