two consecutive "cudaMallocPitch" make the code failed

I wrote a simple CUDA code as follows:

//Allocate the first 2d array "deviceArray2DInput"
if(cudaMallocPitch((Float32**) &deviceArray2DInput, &devicePitch, sizeof(Float32)*deviceColNumber,deviceRowNumber) == cudaErrorMemoryAllocation){
    return -1;
}

//Allocate the second 2d array "deviceArray2DOutput". It was suppose to hold the output of some process.
if(cudaMallocPitch((Float32**) &deviceArray2DOutput, &devicePitch,sizeof(Float32)*deviceRowNumber,deviceColNumber) == cudaErrorMemoryAllocation){
    return -1;
}

//Copy data from "hostArrayR" to "deviceArray2DInput" (#1)
cudaMemcpy2D(deviceArray2DInput,devicePitch,hostArrayR,sizeof(Float32)*colNumber,sizeof(Float32)*deviceColNumber,deviceRowNumber,cudaMemcpyHostToDevice);

//Clean the top 10000 elements in "hostArrayR" for verification. 
for(int i = 0; i < 10000; ++i){
    hostArrayR[i] = 0;
}

//Copy data back from "deviceArray2DInput" to "hostArrayR"(#2)
cudaMemcpy2D(hostArrayR,sizeof(Float32)*colNumber,deviceArray2DInput,devicePitch,sizeof(Float32)*deviceColNumber,deviceRowNumber,cudaMemcpyDeviceToHost);

I commented out the second allocation block, the code worked well. It copied the data from the host array "hostArrayR" to the device array "deviceArray2DInput" and copied it back. However, if both allocation blocks existed, the copied-back "hostArrayR" was empty (no data was copyed back from device).

I am sure that the data was in "hostArrayR" at line (#1) but there was no data at line (#2). I cleaned the first 10000 elements (much lesss than the size of the array) to verfy that data did not come back.

I am using Nvidia Nsight 2.2 on Visual Studio 2010. The array size is 1024x768 and I am using floating 32-bit data. My GPU card is GTX570. It seems that there was no memory allocation error (or the code will return before doing copy stuffs).

I did not try "cudaMalloc()" because I prefer to use "cudaMallocPitch()" for memory alignment.

Solution

You should check the API calls against cudaSuccess, rather than one specific error.
You should check the error value returned by the memcpys.
You're overwriting the devicePitch on the second cudaMallocPitch() call, the arrays have different shapes and hence could have different pitches.