cuda: different answer between cpu and gpu reduce

I got this really weird error. I ran a sum over all elements in a matrix using thrust reduce. It ran well for most data, but it went wrong on one set.

code:

  lbfgsfloatval_t sum(const DeviceVector& A){
    thrust::device_ptr<lbfgsfloatval_t> ptr(A.getPtr());
    thrust::device_vector<double> A_p(ptr, ptr + A.rows()*A.cols());
    lbfgsfloatval_t sums = 0.0;

    // reduce on host
    for(int i = 0; i < A.rows()*A.cols();i++)
        sums += A_p[i];
    // reduce on device
    lbfgsfloatval_t res = thrust::reduce(A_p.begin(), A_p.end());
    cout << "cpu: " << sums << endl; 
    cout << "gpu: " << res  << endl;  
    return res;
 }

Notice the second group went wrong.

output:

cpu: -568.691
gpu: -568.691

cpu: 3.4972e-14
gpu: 1.40998e-14

cpu: 0.234375
gpu: 0.234375

I also tried not building thrust::device_vector, but use a raw pointer instead. Same output. I also tried cublas dot product. Same output.

I used matlab to confirm the cpu result above is correct.

What happened? Was it an underflow on GPU? Thanks!

Solution

I can only speculate on what could go wrong, but I would assume that is an underflow (or specifically, the difference in how CPUs and GPUs handle IEEE-754 denormalized numbers)

http://en.wikipedia.org/wiki/Denormal_number

Basically, CPUs handle them according to IEEE-754 standard, albeit very inefficiently.

GPUs, on the other hand, generally equate them to 0. I do not know if there is a CUDA way to force CPUs to also flush denormalized numbers for development purposes (I mostly do OpenCL), but the C/C++ way is usually

_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

Or, in gcc, compile with -ffast-math.

Check this SO question: Why does changing 0.1f to 0 slow down performance by 10x?