So I am trying to make a code where I can compare the performance of GPU vs CPU in executing merge sort using CUDA.
The CPU function works fine.
But the GPU function fails when I try to input more than 65536
. I am fairly new to CUDA and understand that I need to make a workaround in the part where I determine how many blocks and threads to be used. But I need help with that, can anyone give me some guidance. Thanks in advance.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cuda.h>
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <algorithm>
__global__ void mergeSortGPU(int* arr, int* temp, int size, int mergeSize) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int start = tid * mergeSize * 2;
int mid = start + mergeSize;
int end = start + 2 * mergeSize;
if (start >= size || mid >= size)
return;
if (end > size)
end = size;
int i = start;
int j = mid;
int k = start;
while (i < mid && j < end) {
if (arr[i] <= arr[j])
temp[k++] = arr[i++];
else
temp[k++] = arr[j++];
}
while (i < mid)
temp[k++] = arr[i++];
while (j < end)
temp[k++] = arr[j++];
for (int idx = start; idx < end; ++idx)
arr[idx] = temp[idx];
}
int main() {
int size;
std::cout << "Enter the size of the array: ";
std::cin >> size;
// Allocate memory for the array
int* arr = new int[size];
int* carr = new int[size];
int* temp = new int[size];
srand(static_cast<unsigned int>(time(nullptr)));
for (int i = 0; i < size; ++i) {
arr[i] = rand() % 100;
carr[i] = arr[i];
}
// GPU variables
int* gpuArr;
int* gpuTemp;
int maxThreadsPerBlock = 1024;
int threadsPerBlock = std::min(1024, size / 2);
int blocksPerGrid = (size + (2 * threadsPerBlock) - 1) / (2 * threadsPerBlock);
blocksPerGrid = std::max(blocksPerGrid, 1);
// Allocate memory on GPU
cudaMalloc((void**)&gpuArr, size * sizeof(int));
cudaMalloc((void**)&gpuTemp, size * sizeof(int));
// Copy the input array to GPU memory
cudaMemcpy(gpuArr, arr, size * sizeof(int), cudaMemcpyHostToDevice);
for (int mergeSize = 1; mergeSize < size; mergeSize *= 2) {
mergeSortGPU << <blocksPerGrid, threadsPerBlock >> > (gpuArr, gpuTemp, size, mergeSize);
cudaDeviceSynchronize();
}
// Free allocated memory
delete[] arr;
delete[] carr;
delete[] temp;
cudaFree(gpuArr);
cudaFree(gpuTemp);
return 0;
}
Tried change the blockspergrid
and threadsperblock
to maximum. But I have a feeling that it is not the right way.
PaulMcKenzie basically nailed it. That is the shortest way to discover a problem. But I'll cover what I did, which is starting a typical debug session from my point of view.
When I run your code with compute-sanitizer
and an input size = 131072
, I get all sorts of invalid global read errors. (I don't get any errors with size = 65536.) Let's start the debug there. (Your code is doing illegal things.) If I follow this, the first error report I get is on this line:
if (arr[i] <= arr[j]))
So compute-sanitizer
tells me that on that line, illegal global reads are being made. That could happen if either the i
or j
index is out-of-range. The next thing I did was put a conditional if test before that line to confirm my suspicion:
...
if ((i < 0) || (i >= size) || (j < 0) || (j >= size) printf("oops\n");
else
if (arr[i] <= arr[j]))
...
Compiling that way, sure enough there was plenty of "oops" output when I ran the code.
So something is causing those indexes to get out of bounds in the size=131072
case.
If start
, mid
, and end
are calculated properly (from which i
, j
, and k
are derived), it does not seem possible for them to become out of range. So we must inspect the starting point for these calculations.
After a bit more headscratching this calculation caught my eye:
int start = tid * mergeSize * 2;
With a size
of 131072, you will launch 64 blocks of 1024 threads each, so that is 65536 threads. So tid
can range from 0 to 65535. mergeSize
will range up to 65536 (for the size=131071
case). The overall maximum value for that calculation is 65535655362=8589803520 which will not fit in an int
quantity (the corresponding calculation for size=65536
just barely fits). So you have integer overflow. If I change every usage of int
in your kernel that is being used for an indexing type to size_t
, the compute-sanitizer
errors at size = 131072
go away for me.
For more debug ideas, unit 12 of this online training course may be helpful.
Note that I am not saying that makes your code defect-free, or that it is correctly sorting anything. I didn't check any of that. I just covered how I located a problem and how to resolve that problem.