I'm trying to parallelize some function via CUDA, that is being called for many times. Each time it deals with the same matrix. I want to store this matrix in GPU memory and when function is called, I want to upload vector to GPU and multiply it by matrix and return the result. I prefer C++ template style, so thrust has higher priority.
Please recommend me some functions to do this and if possible some little illustrating samples. I don't provide the code not because it's a secret but because of its complexity and huge size.
For thrust, the device_vector, device_ptr, ect, is what you are looking for.
From thrust::device_vector to raw pointer and back?
But in order to program the GPU efficiently, I suggest also being familiar with the CUDA memory types:
http://www.cvg.ethz.ch/teaching/2011spring/gpgpu/cuda_memory.pdf (pdf warning)
The type of memory you are looking for is "global memory". Remember all this memory is stored on the GPU card, not the CPU card, so it will only be available to kernels and device function calls.
All functor on device pointers just need to be compiled with the device tag (example unary op):
template <typename T>
struct square
{
__host__ __device__
T operator()(const T& x) const {
return x * x;
}
};