NVIDIA CUDA Merge Sort - Function Failure when input above 65536

So I am trying to make a code where I can compare the performance of GPU vs CPU in executing merge sort using CUDA.

The CPU function works fine.

But the GPU function fails when I try to input more than 65536. I am fairly new to CUDA and understand that I need to make a workaround in the part where I determine how many blocks and threads to be used. But I need help with that, can anyone give me some guidance. Thanks in advance.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cuda.h>
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <algorithm>

__global__ void mergeSortGPU(int* arr, int* temp, int size, int mergeSize) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    int start = tid * mergeSize * 2;
    int mid = start + mergeSize;
    int end = start + 2 * mergeSize;

    if (start >= size || mid >= size)
        return;

    if (end > size)
        end = size;

    int i = start;
    int j = mid;
    int k = start;

    while (i < mid && j < end) {
        if (arr[i] <= arr[j])
            temp[k++] = arr[i++];
        else
            temp[k++] = arr[j++];
    }

    while (i < mid)
        temp[k++] = arr[i++];
    while (j < end)
        temp[k++] = arr[j++];

    for (int idx = start; idx < end; ++idx)
        arr[idx] = temp[idx];
}

int main() {
    int size;
    std::cout << "Enter the size of the array: ";
    std::cin >> size;

    // Allocate memory for the array
    int* arr = new int[size];
    int* carr = new int[size];
    int* temp = new int[size];

  srand(static_cast<unsigned int>(time(nullptr)));
    for (int i = 0; i < size; ++i) {
        arr[i] = rand() % 100;
        carr[i] = arr[i];
    }


    // GPU variables
    int* gpuArr;
    int* gpuTemp;
    int maxThreadsPerBlock = 1024;
    int threadsPerBlock = std::min(1024, size / 2);
    int blocksPerGrid = (size + (2 * threadsPerBlock) - 1) / (2 * threadsPerBlock);
    blocksPerGrid = std::max(blocksPerGrid, 1);

    // Allocate memory on GPU
    cudaMalloc((void**)&gpuArr, size * sizeof(int));
    cudaMalloc((void**)&gpuTemp, size * sizeof(int));

    // Copy the input array to GPU memory
    cudaMemcpy(gpuArr, arr, size * sizeof(int), cudaMemcpyHostToDevice);


    for (int mergeSize = 1; mergeSize < size; mergeSize *= 2) {
        mergeSortGPU << <blocksPerGrid, threadsPerBlock >> > (gpuArr, gpuTemp, size, mergeSize);
        cudaDeviceSynchronize();
    }


    // Free allocated memory
    delete[] arr;
    delete[] carr;
    delete[] temp;
    cudaFree(gpuArr);
    cudaFree(gpuTemp);

    return 0;
}

Tried change the blockspergrid and threadsperblock to maximum. But I have a feeling that it is not the right way.

Solution

PaulMcKenzie basically nailed it. That is the shortest way to discover a problem. But I'll cover what I did, which is starting a typical debug session from my point of view.

When I run your code with compute-sanitizer and an input size = 131072, I get all sorts of invalid global read errors. (I don't get any errors with size = 65536.) Let's start the debug there. (Your code is doing illegal things.) If I follow this, the first error report I get is on this line:

if (arr[i] <= arr[j]))

So compute-sanitizer tells me that on that line, illegal global reads are being made. That could happen if either the i or j index is out-of-range. The next thing I did was put a conditional if test before that line to confirm my suspicion:

...
if ((i < 0) || (i >= size) || (j < 0) || (j >= size) printf("oops\n"); 
else
if (arr[i] <= arr[j]))
...

Compiling that way, sure enough there was plenty of "oops" output when I ran the code.

So something is causing those indexes to get out of bounds in the size=131072 case.

If start, mid, and end are calculated properly (from which i, j, and k are derived), it does not seem possible for them to become out of range. So we must inspect the starting point for these calculations.

After a bit more headscratching this calculation caught my eye:

int start = tid * mergeSize * 2;

With a size of 131072, you will launch 64 blocks of 1024 threads each, so that is 65536 threads. So tid can range from 0 to 65535. mergeSize will range up to 65536 (for the size=131071 case). The overall maximum value for that calculation is 65535655362=8589803520 which will not fit in an int quantity (the corresponding calculation for size=65536 just barely fits). So you have integer overflow. If I change every usage of int in your kernel that is being used for an indexing type to size_t, the compute-sanitizer errors at size = 131072 go away for me.

For more debug ideas, unit 12 of this online training course may be helpful.

Note that I am not saying that makes your code defect-free, or that it is correctly sorting anything. I didn't check any of that. I just covered how I located a problem and how to resolve that problem.