I have two GPU cards Tesla C2070 (P2P & UAV support) that I want to Send and Receive data using CUDA.
In GPU A, I have a matrix:
a11 a12 a13 a14
a21 a22
a23 a24
In GPU B, I have another matrix:
b11 b12 b13 b14
b21 b22 b23 b24
I can only send contiguous elements as the code below:
int main(void)
{
float *d_a, *d_b;
int N = 4;
int M = 2;
size_t pitch;
cudaSetDevice(0);
cudaMallocPitch(&d_a, &pitch, sizeof(float)*N, M);
cudaDeviceEnablePeerAccess(1, 0);
cudaSetDevice(1);
cudaMallocPitch(&d_b, &pitch, sizeof(float)*N, M);
cudaDeviceEnablePeerAccess(0, 0);
//Initialization for d_a
//Initialization for d_b
//Copy M*N/2 element from d_a to d_b, starting from d_a[1]
cudaMemcpy(&d_b[1], &d_a[1], M*N/2*sizeof(float), cudaMemcpyDefault);
//Print result d_b
}
How to send the last two columns of the matrix from GPU A to GPU B directly, so on GPU B I will get:
b11 b12 a13 a14
b21 b22 a23 a24
Similarly, how to send the first row of the matrix from GPU A to GPU B, so on GPU B I will get:
a11 a12 a13 a14
b21 b22 b23 b24
If I have 1-D array as follow: a1 a2 a3 a4 a5 a6 a7 a8.....
How to send elements 1,4,7,...(every 3 elements) from GPU A to replace the same ones on GPU B?
The API call you need to look at is cudaMemcpy2D
. This allows fairly straightforward copying of all or portions of pitched data, and is the natural counterpart of cudaMallocPitch
.
If we leave aside the multiGPU aspect of your question for a moment, and just focus on the copying of pitched data (in UVA platforms, how GPU to GPU transfers are handled is basically an implementation detail you don't need to know about), there are only three things required to do what you want:
cudaMallocPitch
). Note you should keep a pitch for each pointer you allocate. There is no guarantee that the API will return the same pitch for two different allocations of the same size, this is particularly true if the allocations are not on the same deviceHere is a concrete example based off the code you posted which performs copying of a subset of data between two pitched allocations assuming column major order. Note that for brevity, I have encapsulated most of the addressing mechanics in a simple class which can be used on both the host and device. Two 5x10 pitched arrays are allocated, and a 3x3 sub array is copied from one to the other. I have used kernel printf
to show the copying action:
#include <cstdio>
struct mat
{
int m, n;
size_t pitch;
char *ptr;
__device__ __host__
mat(int _m, int _n, size_t _pitch, char *_ptr) : m(_m), n(_n), pitch(_pitch), ptr(_ptr) {};
__device__ __host__ float * getptr(int i=0, int j=0) {
float * col = (float*)(ptr + j*pitch);
return col + i;
};
__device__ __host__ float& operator() (int i, int j) {
return *getptr(i,j);
};
__device__ __host__
void print() {
for(int i=0; i<m; i++) {
for(int j=0; j<n; j++) {
printf("%4.f ", (*this)(i,j));
}
printf("\n");
}
};
};
__global__ void printmat(struct mat x) { x.print(); }
int main(void)
{
const int M = 5, N = 10;
const size_t hostpitch = M * sizeof(float);
float *a = new float[M*N], *b = new float[M*N];
mat A(M, N, hostpitch, (char *)(a));
mat B(M, N, hostpitch, (char *)(b));
for(int v=0, j=0; j<N; j++) {
for(int i=0; i<M; i++) {
A(i,j) = (float)v; B(i,j) = (float)(100+v++);
}
}
char *d_a, *d_b;
size_t pitch_a, pitch_b;
cudaMallocPitch((void **)&d_a, &pitch_a, sizeof(float)*M, N);
cudaMallocPitch((void **)&d_b, &pitch_b, sizeof(float)*M, N);
mat Ad(M, N, pitch_a, d_a); mat Bd(M, N, pitch_b, d_b);
cudaMemcpy2D(Ad.getptr(), Ad.pitch, A.getptr(), A.pitch,
A.pitch, A.n, cudaMemcpyHostToDevice);
printmat<<<1,1>>>(Ad);
cudaMemcpy2D(Bd.getptr(), Bd.pitch, B.getptr(), B.pitch,
B.pitch, B.n, cudaMemcpyHostToDevice);
printmat<<<1,1>>>(Bd);
int ci = 3, cj = 3;
cudaMemcpy2D(Ad.getptr(1,1), Ad.pitch, Bd.getptr(1,1), Bd.pitch,
ci*sizeof(float), cj, cudaMemcpyDeviceToDevice);
printmat<<<1,1>>>(Ad); cudaDeviceSynchronize();
return 0;
}
which does this:
>nvcc -m32 -Xptxas="-v" -arch=sm_21 pitched.cu
pitched.cu
tmpxft_00001348_00000000-5_pitched.cudafe1.gpu
tmpxft_00001348_00000000-10_pitched.cudafe2.gpu
pitched.cu
ptxas : info : 0 bytes gmem, 8 bytes cmem[2]
ptxas : info : Compiling entry function '_Z8printmat3mat' for 'sm_21'
ptxas : info : Function properties for _Z8printmat3mat
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 23 registers, 48 bytes cmem[0]
tmpxft_00001348_00000000-5_pitched.cudafe1.cpp
tmpxft_00001348_00000000-15_pitched.ii
>cuda-memcheck a.exe
========= CUDA-MEMCHECK
0 5 10 15 20 25 30 35 40 45
1 6 11 16 21 26 31 36 41 46
2 7 12 17 22 27 32 37 42 47
3 8 13 18 23 28 33 38 43 48
4 9 14 19 24 29 34 39 44 49
100 105 110 115 120 125 130 135 140 145
101 106 111 116 121 126 131 136 141 146
102 107 112 117 122 127 132 137 142 147
103 108 113 118 123 128 133 138 143 148
104 109 114 119 124 129 134 139 144 149
0 5 10 15 20 25 30 35 40 45
1 106 111 116 21 26 31 36 41 46
2 107 112 117 22 27 32 37 42 47
3 108 113 118 23 28 33 38 43 48
4 9 14 19 24 29 34 39 44 49
========= ERROR SUMMARY: 0 errors