I just started learning CUDA
programming. I was trundling through some simple CUDA C
examples and everything was going swimmingly. Then! Suddenly! Thrust! I consider myself versed on C++ functors and was taken aback at the difference between CUDA C
and Thrust
I find it hard to believe that
__global__ void square(float *a, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
a[idx] = a[idx] * a[idx];
}
}
int main(int argc, char** argv) {
float *aHost, *aDevice;
const int N = 10;
size_t size = N * sizeof(float);
aHost = (float*)malloc(size);
cudaMalloc((void**)&aDevice, size);
for (int i = 0; i < N; i++) {
aHost[i] = (float)i;
}
cudaMemcpy(aDevice, aHost, size, cudaMemcpyHostToDevice);
int block = 4;
int nBlock = N/block + (N % block == 0 ? 0:1);
square<<<nBlock, block>>>(aDevice, N);
cudaMemcpy(aHost, aDevice, size, cudaMemcpyDeviceToHost);
for (int i = 0; i < N; i++) {
printf("%d, %f\n", i, aHost[i]);
}
free(aHost);
cudaFree(aDevice);
}
is equvalent to
template <typename T>
struct square {
__host__ __device__ T operator()(const T& x) const {
return x * x;
}
};
int main(int argc, char** argv) {
const int N = 10;
thrust::device_vector<float> dVec(N);
thrust::sequence(dVec.begin(), dVec.end());
thrust::transform(dVec.begin(), dVec.end(), dVec.begin(), square<float>());
thrust::copy(dVec.begin(), dVec.end(), std::ostream_iterator<float>(std::cout, "\n"));
}
Am I missing something? Is the above code being run on the GPU? Thrust is a great tool, but I'm skeptical that it takes care of all the heavy C-style memory management.
Thrust
code being executed on the GPU? How can I tell?Thrust
eliminate the bizarre syntax of evoking a kernel?Thrust
actually evoking a kernel?Thrust
automatically handle the thread index computation?Thanks for your time. Sorry if these are silly questions, but I find it incredulous that the examples I've seen transition instantly from what can be described as a Model T to a M3.
Roughly: yes, of course. Thrust is a library, so as all of them are born to make it easier. Its great point is avoiding all explicit CUDA code, which looks strange for rest of programmers, providing a friendly C++-like interface.
Thrust uses GPU, but not just GPU. It makes same operations you make if you write your own code, i.e., C/C++ code for allocating memory, copying, set grid and block sizes... and then invokes GPU for executing kernel.
It is a good choice for those who don't want to get inside low level CUDA stuff but to take advantage of GPU parallelism in a simple (but frequent) problem, like vector operations.