I'm new to GPU world and just installed CUDA for writing some program. I played with thrust library but find out that it is so slow when uploading data to GPU. Just about 35MB/s in host-to-device part on my not-bad desktop. How come it is?
Environment: Visual Studio 2012, CUDA 5.0, GTX760, Intel-i7, Windows 7 x64
GPU Bandwidth test:
It is supposed to have at least 11GB/s of transfer speed for host to device or vice versa! But it didn't!
Here's the test program:
#include <iostream>
#include <ctime>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#define N 32<<22
int main(void)
{
using namespace std;
cout<<"GPU bandwidth test via thrust, data size: "<< (sizeof(double)*N) / 1000000000.0 <<" Gbytes"<<endl;
cout<<"============program start=========="<<endl;
int now = time(0);
cout<<"Initializing h_vec...";
thrust::host_vector<double> h_vec(N,0.0f);
cout<<"time spent: "<<time(0)-now<<"secs"<<endl;
now = time(0);
cout<<"Uploading data to GPU...";
thrust::device_vector<double> d_vec = h_vec;
cout<<"time spent: "<<time(0)-now<<"secs"<<endl;
now = time(0);
cout<<"Downloading data to h_vec...";
thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
cout<<"time spent: "<<time(0)-now<<"secs"<<endl<<endl;
system("PAUSE");
return 0;
}
Program out put:
Download speed: less than 1 sec, pretty make sense compare to nominal 11GB/s.
Upload speed: 1.07374GB /32 secs is about to be 33.5 MB/s, which doesn't make sense at all.
Does anyone know the reason? Or is it just the way thrust is?
Thanks!!
Your comparison has several flaws, some of which are covered in the comments.
bandwidthTest
is using a PINNED
memory allocation, which thrust does not use. Therefore the thrust data transfer rate will be slower. This typically contributes about a 2x factor (i.e. pinned memory transfers are typically about 2x faster than pageable memory transfers. If you want a better comparison with bandwidthTest
run it with the --memory=pageable
switch.Here is a code which does proper timing:
$ cat t213.cu
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/copy.h>
#include <thrust/fill.h>
#define DSIZE ((1UL<<20)*32)
int main(){
thrust::device_vector<int> d_data(DSIZE);
thrust::host_vector<int> h_data(DSIZE);
float et;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
thrust::fill(h_data.begin(), h_data.end(), 1);
thrust::copy(h_data.begin(), h_data.end(), d_data.begin());
std::cout<< "warm up iteration " << d_data[0] << std::endl;
thrust::fill(d_data.begin(), d_data.end(), 2);
thrust::copy(d_data.begin(), d_data.end(), h_data.begin());
std::cout<< "warm up iteration " << h_data[0] << std::endl;
thrust::fill(h_data.begin(), h_data.end(), 3);
cudaEventRecord(start);
thrust::copy(h_data.begin(), h_data.end(), d_data.begin());
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&et, start, stop);
std::cout<<"host to device iteration " << d_data[0] << " elapsed time: " << (et/(float)1000) << std::endl;
std::cout<<"apparent bandwidth: " << (((DSIZE*sizeof(int))/(et/(float)1000))/((float)1048576)) << " MB/s" << std::endl;
thrust::fill(d_data.begin(), d_data.end(), 4);
cudaEventRecord(start);
thrust::copy(d_data.begin(), d_data.end(), h_data.begin());
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&et, start, stop);
std::cout<<"device to host iteration " << h_data[0] << " elapsed time: " << (et/(float)1000) << std::endl;
std::cout<<"apparent bandwidth: " << (((DSIZE*sizeof(int))/(et/(float)1000))/((float)1048576)) << " MB/s" << std::endl;
std::cout << "finished" << std::endl;
return 0;
}
I compile with (I have a PCIE Gen2 system with a cc2.0 device)
$ nvcc -O3 -arch=sm_20 -o t213 t213.cu
When I run it I get the following results:
$ ./t213
warm up iteration 1
warm up iteration 2
host to device iteration 3 elapsed time: 0.0476644
apparent bandwidth: 2685.44 MB/s
device to host iteration 4 elapsed time: 0.0500736
apparent bandwidth: 2556.24 MB/s
finished
$
This looks correct to me because a bandwidthTest
on my system would report about 6GB/s in either direction as I have a PCIE Gen2 system. Since thrust uses pageable, not pinned memory, I get about half that bandwidth, i.e. 3GB/s, and thrust is reporting about 2.5GB/s.
For comparison, here is the bandwidth test on my system, using pageable memory:
$ /usr/local/cuda/samples/bin/linux/release/bandwidthTest --memory=pageable
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: Quadro 5000
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2718.2
Device to Host Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2428.2
Device to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 99219.1
$