CUDA: Why Thrust is so slow on uploading data to GPU?

I'm new to GPU world and just installed CUDA for writing some program. I played with thrust library but find out that it is so slow when uploading data to GPU. Just about 35MB/s in host-to-device part on my not-bad desktop. How come it is?

Environment: Visual Studio 2012, CUDA 5.0, GTX760, Intel-i7, Windows 7 x64

GPU Bandwidth test: enter image description here

It is supposed to have at least 11GB/s of transfer speed for host to device or vice versa! But it didn't!

Here's the test program:

#include <iostream>
#include <ctime>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

#define N 32<<22

int main(void)
{
    using namespace std;

    cout<<"GPU bandwidth test via thrust, data size: "<< (sizeof(double)*N) / 1000000000.0 <<" Gbytes"<<endl;
    cout<<"============program start=========="<<endl;

    int now = time(0);
    cout<<"Initializing h_vec...";
    thrust::host_vector<double> h_vec(N,0.0f);
    cout<<"time spent: "<<time(0)-now<<"secs"<<endl;

    now = time(0);
    cout<<"Uploading data to GPU...";
    thrust::device_vector<double> d_vec = h_vec;
    cout<<"time spent: "<<time(0)-now<<"secs"<<endl;

    now = time(0);
    cout<<"Downloading data to h_vec...";
    thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
    cout<<"time spent: "<<time(0)-now<<"secs"<<endl<<endl;

    system("PAUSE");
    return 0;
}

Program out put: enter image description here

Download speed: less than 1 sec, pretty make sense compare to nominal 11GB/s.
Upload speed: 1.07374GB /32 secs is about to be 33.5 MB/s, which doesn't make sense at all.

Does anyone know the reason? Or is it just the way thrust is?

Thanks!!

Solution

Your comparison has several flaws, some of which are covered in the comments.

You need to eliminate any allocation effects. You can do this by doing some "warm-up" transfers first.
You need to eliminate any "start-up" effects. You can do this by doing some "warm-up" transfers first.
When comparing the data, remember that bandwidthTest is using a PINNED memory allocation, which thrust does not use. Therefore the thrust data transfer rate will be slower. This typically contributes about a 2x factor (i.e. pinned memory transfers are typically about 2x faster than pageable memory transfers. If you want a better comparison with bandwidthTest run it with the --memory=pageable switch.
Your choice of timing functions might not be the best. cudaEvents is pretty reliable for timing CUDA operations.

Here is a code which does proper timing:

$ cat t213.cu
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/copy.h>
#include <thrust/fill.h>

#define DSIZE ((1UL<<20)*32)

int main(){

  thrust::device_vector<int> d_data(DSIZE);
  thrust::host_vector<int> h_data(DSIZE);
  float et;
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);

  thrust::fill(h_data.begin(), h_data.end(), 1);
  thrust::copy(h_data.begin(), h_data.end(), d_data.begin());

  std::cout<< "warm up iteration " << d_data[0] << std::endl;
  thrust::fill(d_data.begin(), d_data.end(), 2);
  thrust::copy(d_data.begin(), d_data.end(), h_data.begin());
  std::cout<< "warm up iteration " << h_data[0] << std::endl;
  thrust::fill(h_data.begin(), h_data.end(), 3);
  cudaEventRecord(start);
  thrust::copy(h_data.begin(), h_data.end(), d_data.begin());
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&et, start, stop);
  std::cout<<"host to device iteration " << d_data[0] << " elapsed time: " << (et/(float)1000) << std::endl;
  std::cout<<"apparent bandwidth: " << (((DSIZE*sizeof(int))/(et/(float)1000))/((float)1048576)) << " MB/s" << std::endl;
  thrust::fill(d_data.begin(), d_data.end(), 4);
  cudaEventRecord(start);
  thrust::copy(d_data.begin(), d_data.end(), h_data.begin());
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&et, start, stop);
  std::cout<<"device to host iteration " << h_data[0] << " elapsed time: " << (et/(float)1000) << std::endl;
  std::cout<<"apparent bandwidth: " << (((DSIZE*sizeof(int))/(et/(float)1000))/((float)1048576)) << " MB/s" << std::endl;

  std::cout << "finished" << std::endl;
  return 0;
}

I compile with (I have a PCIE Gen2 system with a cc2.0 device)

$ nvcc -O3 -arch=sm_20 -o t213 t213.cu

When I run it I get the following results:

$ ./t213
warm up iteration 1
warm up iteration 2
host to device iteration 3 elapsed time: 0.0476644
apparent bandwidth: 2685.44 MB/s
device to host iteration 4 elapsed time: 0.0500736
apparent bandwidth: 2556.24 MB/s
finished
$

This looks correct to me because a bandwidthTest on my system would report about 6GB/s in either direction as I have a PCIE Gen2 system. Since thrust uses pageable, not pinned memory, I get about half that bandwidth, i.e. 3GB/s, and thrust is reporting about 2.5GB/s.

For comparison, here is the bandwidth test on my system, using pageable memory:

$ /usr/local/cuda/samples/bin/linux/release/bandwidthTest --memory=pageable
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Quadro 5000
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     2718.2

 Device to Host Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     2428.2

 Device to Device Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     99219.1

$