I got this really weird error. I ran a sum over all elements in a matrix using thrust reduce. It ran well for most data, but it went wrong on one set.
code:
lbfgsfloatval_t sum(const DeviceVector& A){
thrust::device_ptr<lbfgsfloatval_t> ptr(A.getPtr());
thrust::device_vector<double> A_p(ptr, ptr + A.rows()*A.cols());
lbfgsfloatval_t sums = 0.0;
// reduce on host
for(int i = 0; i < A.rows()*A.cols();i++)
sums += A_p[i];
// reduce on device
lbfgsfloatval_t res = thrust::reduce(A_p.begin(), A_p.end());
cout << "cpu: " << sums << endl;
cout << "gpu: " << res << endl;
return res;
}
Notice the second group went wrong.
output:
cpu: -568.691
gpu: -568.691
cpu: 3.4972e-14
gpu: 1.40998e-14
cpu: 0.234375
gpu: 0.234375
I also tried not building thrust::device_vector, but use a raw pointer instead. Same output. I also tried cublas dot product. Same output.
I used matlab to confirm the cpu result above is correct.
What happened? Was it an underflow on GPU? Thanks!
I can only speculate on what could go wrong, but I would assume that is an underflow (or specifically, the difference in how CPUs and GPUs handle IEEE-754 denormalized numbers)
http://en.wikipedia.org/wiki/Denormal_number
Basically, CPUs handle them according to IEEE-754 standard, albeit very inefficiently.
GPUs, on the other hand, generally equate them to 0. I do not know if there is a CUDA way to force CPUs to also flush denormalized numbers for development purposes (I mostly do OpenCL), but the C/C++ way is usually
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
Or, in gcc, compile with -ffast-math
.
Check this SO question: Why does changing 0.1f to 0 slow down performance by 10x?