Simpson's method to integrate real valued functions with CUDA

I'm trying to code integration by Simpson's method in CUDA.

This is the formula for Simpson's rule

enter image description here

where x_k = a + k*h.

Here's my code

    __device__ void initThreadBounds(int *n_start, int *n_end, int n, 
                                        int totalBlocks, int blockWidth)
    {
        int threadId = blockWidth * blockIdx.x + threadIdx.x;
        int nextThreadId = threadId + 1;

        int threads = blockWidth * totalBlocks;

        *n_start = (threadId * n)/ threads;
        *n_end =  (nextThreadId * n)/ threads;
    }

    __device__ float reg_func (float x)
    {
        return x;
    }

    typedef float (*p_func) (float);

    __device__ p_func integrale_f = reg_func;

    __device__ void integralSimpsonMethod(int totalBlocks, int totalThreads, 
                    double a, double b, int n, float p_function(float), float* result)
    {
        *result = 0;

        float h = (b - a)/n; 
        //*result = p_function(a)+p_function(a + h * n);
        //parallel
        int idx_start;
        int idx_end;
        initThreadBounds(&idx_start, &idx_end, n-1, totalBlocks, totalThreads);
        //parallel_ends
        for (int i = idx_start; i < idx_end; i+=2) {
            *result +=  ( p_function(a + h*(i-1)) + 
                          4 * p_function(a + h*(i)) + 
                          p_function(a + h*(i+1)) ) * h/3;

        }   
    } 


    __global__ void integralSimpson(int totalBlocks, int totalThreads,  float* result)
    {
        float res = 0;

        integralSimpsonMethod(totalBlocks, totalThreads, 0, 10, 1000, integrale_f, &res);
        result[(blockIdx.x*totalThreads + threadIdx.x)] = res;

        //printf ("Simpson method\n");
    }


    __host__ void inttest()
    {

        const int blocksNum = 32;
        const int threadNum = 32;

        float   *device_resultf; 
        float   host_resultf[threadNum*blocksNum]={0};


        cudaMalloc((void**) &device_resultf, sizeof(float)*threadNum*blocksNum);
            integralSimpson<<<blocksNum, threadNum>>>(blocksNum, threadNum, device_resultf);
        cudaThreadSynchronize();

        cudaMemcpy(host_resultf, device_resultf, sizeof(float) *threadNum*blocksNum, 
                      cudaMemcpyDeviceToHost);

        float sum = 0;
        for (int i = 0; i != blocksNum*threadNum; ++i) {
            sum += host_resultf[i];
            //  printf ("result in %i cell = %f \n", i, host_resultf[i]);
        }
        printf ("sum = %f \n", sum);
        cudaFree(device_resultf);
    }

int main(int argc, char* argv[])
{


   inttest();


    int i;
    scanf ("%d",&i);

}

The problem is: it works wrong when n is lower than 100000. For an integral from 0 to 10, the result is ~99, but when n = 100000 or larger it works fine and the result is ~50.

What's wrong, guys?

Solution

The basic problem here is that you don't understand your own algorithm.

Your integralSimpsonMethod() function is designed such that each thread is sampling at least 3 quadrature points per sub-interval in the integral domain. Therefore, if you choose n so that it is less than four times the number of threads in the kernel call, it is inevitable that each sub interval will overlap and the resulting integral will be incorrect. You need to make sure that the code checks and scales the thread count or n so that they don't produce overlap when the integral is computed.

If you are doing this for anything other than self-edification, then I recommend you look up the composite version of Simpson's rule. This is much better suited to parallel implementation and will be considerably more performant if implemented correctly.