Looping over data in CUDA kernel causes app to abort

issue:

As I increase the amount of data that is being processed inside of loop that is inside of CUDA kernel - it causes the app to abort!

exception:

ManagedCuda.CudaException: 'ErrorLaunchFailed: An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointer and accessing out of bounds shared memory.

question:

I would appreciate if somebody could shed a light on limitations that I am hitting with my current implementation and what exactly causes the app to crash..

Alternatively, I am attaching a full kernel code, for the sake if somebody could say how it can be re-modelled in such a way, when no exceptions are thrown. The idea is that kernel is accepting combinations and then performing calculations on the same set of data (in a loop). Therefore, loop calculations that are inside shall be sequential. The sequence in which kernel itself is executed is irrelevant. It's combinatorics problem.

Any bit of advice is welcomed.

code (Short version, which is enough to abort the app):

extern "C"
{
    __device__ __constant__ int arraySize;

    __global__ void myKernel(
        unsigned char* __restrict__  output,
        const int* __restrict__  in1,
        const int* __restrict__  in2,
        const double* __restrict__  in3,
        const unsigned char* __restrict__  in4)
    {
        for (int row = 0; row < arraySize; row++)
        {
            // looping over sequential data.
        }
    }
}

In the example above if the arraySize is somewhere close to 50_000 then the app starts to abort. With the same kind of input parameters, if we override or hardcore the arraySize to 10_000 then the code finishes successfully.

code - kernel (full version)

#iclude <cuda.h> 
#include "cuda_runtime.h"
#include <device_launch_parameters.h> 
#include <texture_fetch_functions.h> 
#include <builtin_types.h> 

#define _SIZE_T_DEFINED

#ifndef __CUDACC__
#define __CUDACC__
#endif

#ifndef __cplusplus
#define __cplusplus
#endif

texture<float2, 2> texref;

extern "C"
{
    __device__ __constant__ int width;
    __device__ __constant__ int limit;
    __device__ __constant__ int arraySize;

    __global__ void myKernel(
        unsigned char* __restrict__  output,
        const int* __restrict__  in1,
        const int* __restrict__  in2,
        const double* __restrict__  in3,
        const unsigned char* __restrict__  in4)
    {
        int index = blockIdx.x * blockDim.x + threadIdx.x;

        if (index >= limit)
            return;

        bool isTrue = false;
        int varA = in1[index];
        int varB = in2[index];

        double calculatable = 0;
        for (int row = 0; row < arraySize; row++)
        {
            if (isTrue)
            {
                int idx = width * row + varA;
                if (!in4[idx])
                    continue;

                calculatable = calculatable + in3[row];
                isTrue = false;
            }
            else
            {
                int idx = width * row + varB;
                if (!in4[idx])
                    continue;

                calculatable = calculatable - in3[row];
                isTrue = true;
            }
        }

        if (calculatable >= 0) {
            output[index] = 1;
        }
    }
}

code - host (full version)

    public static void test()
    {
        int N = 10_245_456; // size of an output

        CudaContext cntxt = new CudaContext();
        CUmodule cumodule = cntxt.LoadModule(@"kernel.ptx");
        CudaKernel myKernel = new CudaKernel("myKernel", cumodule, cntxt);

        myKernel.GridDimensions = (N + 255) / 256;
        myKernel.BlockDimensions = Math.Min(N, 256);

        // output
        byte[] out_host = new byte[N]; // i.e. bool
        var out_dev = new CudaDeviceVariable<byte>(out_host.Length);

        // input
        int[] in1_host = new int[N];
        int[] in2_host = new int[N];
        double[] in3_host = new double[50_000]; // change it to 10k and it's OK
        byte[] in4_host = new byte[10_000_000]; // i.e. bool
        var in1_dev = new CudaDeviceVariable<int>(in1_host.Length);
        var in2_dev = new CudaDeviceVariable<int>(in2_host.Length);
        var in3_dev = new CudaDeviceVariable<double>(in3_host.Length);
        var in4_dev = new CudaDeviceVariable<byte>(in4_host.Length);

        // copy input parameters
        in1_dev.CopyToDevice(in1_host);
        in2_dev.CopyToDevice(in2_host);
        in3_dev.CopyToDevice(in3_host);
        in4_dev.CopyToDevice(in4_host);

        myKernel.SetConstantVariable("width", 2);
        myKernel.SetConstantVariable("limit", N);
        myKernel.SetConstantVariable("arraySize", in3_host.Length);

        // exception is thrown here
        myKernel.Run(out_dev.DevicePointer, in1_dev.DevicePointer, in2_dev.DevicePointer,in3_dev.DevicePointer, in4_dev.DevicePointer);

        out_dev.CopyToHost(out_host);
    }

analysis

My initial assumption was that I am having memory issues, however, according to VS debugger I am hitting a little above 500mb of data on a host environment. So I imagine that no matter how much data I copy to GPU - it shouldn't exceed 1Gb or even maximum 11Gb. Later on I have noticed that the crashing only is happening when the loop that is inside a kernel is having many records of data to process. It makes me to believe that I am hitting some kind of thread time-out limitations or something of that sort. Without a solid proof.

system

My system specs are 16Gb of Ram, and GeForce 1080 Ti 11Gb. Using Cuda 9.1., and managedCuda version 8.0.22 (also tried with 9.x version from master branch)

edit 1: 26.04.2018 Just tested the same logic, but only on OpenCL. The code not only finished successfully, but also performs 1.5-5x time better than the CUDA, depending on the input parameter sizes:

kernel void Test (global bool* output, global const int* in1, global const int* in2, global const double* in3, global const bool* in4, const int width, const int arraySize)
{
    int index = get_global_id(0);

    bool isTrue = false;
    int varA = in1[index];
    int varB = in2[index];

    double calculatable = 0;

    for (int row = 0; row < arraySize; row++)
    {
        if (isTrue)
        {
            int idx = width * row + varA;

            if (!in4[idx]) {
                continue;
            }

            calculatable = calculatable + in3[row];
            isTrue = false;
        }
        else
        {
            int idx = width * row + varB;

            if (!in4[idx]) {
                continue;   
            }

            calculatable = calculatable - in3[row];
            isTrue = true;
        }
    }

    if (calculatable >= 0)
    {
        output[index] = true;
    }
}

I don't really want to start OpenCL/CUDA war here. If there is anything I should be concerned about in my original CUDA implementation - please let me know.

edit: 26.04.2018. After following suggestions from the comment section I was able to increase the amount of data processed, before an exception is thrown, by 3x. I was able to achieve that by switching to .ptx generated in Release mode, rather than Debug mode. This improvement could be related to the fact that in Debug settings we also have Generate GPU Debug information set to Yes and other unnecessary settings that could affect performance.. I will now try to search info about how timings can be increased for kernel.. I am still not reaching the results of OpenCL, but getting close.

For CUDA file generation I am using VS2017 Community, CUDA 9.1 project, v140 toolset, build for x64 platform, post build events disabled, configuration type: utility. Code generation set to: compute_30,sm_30. I am not sure why it's not sm_70, for example. I don't have other options.

Solution

I have managed to improve the CUDA performance over OpenCL. And what's more important - the code can now finish executing without exceptions. The credits go to Robert Crovella. Thank You!

Before showing the results here are some specs:

CPU Intel i7 8700k 12 cores (6+6)
GPU GeForce 1080 Ti 11Gb

Here are my results (library/technology):

CPU parallel for loop: 607907 ms (default)
GPU (Alea, CUDA): 9905 ms (x61)
GPU (managedCuda, CUDA): 6272 ms (x97)
GPU (Coo, OpenCL): 8277 ms (x73)

THE solution 1:

The solution was to increase the WDDM TDR Delay from default 2 seconds to 10 seconds. As easy as that.

The solution 2:

I was able to squeeze out a bit more of performance by:

updating the compute_30,sm_30 settings to compute_61,sm_61 in CUDA project properties
using the Release settings instead of Debug
using .cubin file instead of .ptx

If anyone still wants to suggesst some ideas on how to improve the performance any further - please share them! I am opened to ideas. This question has been resolved, though!

p.s. if your display blinks in the same fashion as described here, then try increasing the delay as well.