Different results after reduction

I've got two reduction algorithms, both are from docs.nvidia, so they should be correct but the first (very effective) gives me a wrong result. Second result is better but I expected better accuracy. Is there any error in algorithms or am I doing something in a bad way?

#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>
#include <math.h>
#include "cuda_error.h"

//Lock definition
#ifndef __LOCK_H__
#define __LOCK_H__
struct Lock {
int *mutex;
Lock( void ) {
CudaSafeCall( cudaMalloc( (void**)&mutex,
sizeof(int) ) );
CudaSafeCall( cudaMemset( mutex, 0, sizeof(int) ) );
}
~Lock( void ) {
cudaFree( mutex );
}
__device__ void lock( void ) {
while( atomicCAS( mutex, 0, 1 ) != 0 );
}
__device__ void unlock( void ) {
atomicExch( mutex, 0 );
}
};
#endif
//-------------------------


const int N = 507904;
const int blockSize = 256;
int blocks = N/blockSize;

template <unsigned int blockSize>
__global__ void reduce(Lock lock, float *g_idata, float *g_odata, int n)
{
      extern __shared__ int sdata[];
      unsigned int tid = threadIdx.x;
      unsigned int i = blockIdx.x*(blockSize*2) + tid;
      unsigned int gridSize = blockSize*2*gridDim.x;
      sdata[tid] = 0;

      while (i < n) 
      { 
          sdata[tid] += g_idata[i] + g_idata[i+blockSize]; 
          i += gridSize; 
      }

      __syncthreads();

      if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
      if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
      if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }

      if (tid < 32) 
      {
          if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
          if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
          if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
          if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
          if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
          if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
      }

    if (tid == 0)
    {
        lock.lock();        
        *g_odata += sdata[0];
        lock.unlock();
    }

}

__global__ void reduction_sum(Lock lock,float *in, float *out, int N) 
{
    extern __shared__ float sf_data[];
    int tid = threadIdx.x;
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    sf_data[tid] = (i<N) ? in[i] : 0;

    __syncthreads();

    for (int s = blockDim.x/2; s>0; s>>=1) 
    {
      if (tid < s) 
        {
        sf_data[tid] += sf_data[tid + s];
      }
      __syncthreads();
    }

    if (tid == 0)
    {
        lock.lock();        
        *out += sf_data[0];
        lock.unlock();
    }
}
//initializing function
double los()
{
    return (double)rand()/(double)RAND_MAX;
}
//-------------------------------------------


int main (void)
{

//memory allocation 
    float *a;
    float *dev_a, *dev_blocksum1, *dev_blocksum2;
    float s1=0, s2=0, spr=0;

    a = (float*)malloc( N*sizeof(float) );
    CudaSafeCall( cudaMalloc( (void**)&dev_a, N*sizeof(float) ) );
    CudaSafeCall( cudaMemset( dev_a, 0, N*sizeof(float) ) );
    CudaSafeCall( cudaMalloc( (void**)&dev_blocksum1, sizeof(float) ) );
    CudaSafeCall( cudaMalloc(   (void**)&dev_blocksum2, sizeof(float)   )   );
    CudaSafeCall( cudaMemset( dev_blocksum1, 0, sizeof(float) ) );
    CudaSafeCall( cudaMemset( dev_blocksum2, 0, sizeof(float) ) );
//--------------------

//drawing, array fill
    srand(2403);
    int i;
    for (i=0; i<N; i++)
    {
        a[i]=los();
        spr+=a[i];
    }
    printf("CPU sum: %f \n", spr);
//copy HtoD
    CudaSafeCall( cudaMemcpy( dev_a, a, N*sizeof(float), cudaMemcpyHostToDevice ) );
//---------------------

Lock lock;

//reduce
    reduce<blockSize><<<blocks, blockSize, blockSize*sizeof(float)>>>(lock, dev_a, dev_blocksum1, N);
    CudaSafeCall(   cudaMemcpy ( &s1, dev_blocksum1, sizeof(float), cudaMemcpyDeviceToHost  )   );
    printf("GPU sum1: %f \n", s1);
//-----------------------

//reduction_sum
    reduction_sum<<<blocks, blockSize, blockSize*sizeof(float)>>>(lock, dev_a, dev_blocksum2, N);
    CudaSafeCall(   cudaMemcpy ( &s2, dev_blocksum2, sizeof(float), cudaMemcpyDeviceToHost  )   );
    printf("GPU sum2: %f \n", s2);
//---------------------

    return 0;
}

$ CPU sum: 253833.515625 
$ GPU sum1: 14021.000000 
$ GPU sum2: 253835.906250

Solution

There are a number of things to mention.

I'm not sure your error checking is valid. You haven't shown the file that implements this, and when I run your code with cuda-memcheck, I get various errors reported. They all seem to be related to the lock function.
I'm not sure why you would use the lock function and I don't recommend it. I don't think it is dropping out of scope when you think it is, based on the way you are using it. I would suggest using atomicAdd instead, which should be faster and simpler. At a minimum, comment out the cudaFree() statement in your destructor.
You are linking to an old presentation. If you review a newer version of it then I think you will see that it is now recommending use of volatile. I'm not going to rewrite your whole code for you, nor summarize that whitepaper again, but if you simply add volatile to the shared memory declaration for demonstration purposes, it will fix the resultant issue.
Your shared memory variable is declared as int but you are summing float quantities. That won't work the way you want. You could declare it like this instead:
```
extern __shared__ volatile float sdata[];
```
The above changes get the code "working" for me. The remaining item is the discrepancy between CPU and GPU results. I believe this is due to floating-point order of operations not being the same on the CPU (serial reduction) vs. GPU (parallel reduction). Since the discrepancy arises in the 6th significant digit on a float quantity, I would suggest this is well within the range of reasonableness for floating point results comparison. If you desire more accuracy, you might try switching from float to double. Also, there are various floating point papers you may wish to read that will help with understanding here, such as this one and this one.