Parallel QuickSort, can someone help me?

I am trying to implement the quicksort parallelization by specifying the list separation snippet in two others compared to the pivo. I am having problems with the syntax and to save the pointer at the end of the two new lists. How do I get rid of the syntax errors and save the list sizes at the end of the kernel?


    import pycuda.autoinit
    import pycuda.driver as cuda
    from pycuda import gpuarray, compiler
    from pycuda.compiler import SourceModule
    import time
    import numpy as np


    def quickSort_paralleloGlobal(listElements: list) -> list:

        if len(listElements) <= 1:  

            return listElements

        else:

            pivo = listElements.pop()
            list1 = []
            list2 = []

            kernel_code_template = """
                    __global__ void separateQuick(int *listElements, int *list1, int *list2, int pivo)
                    {
                        int index1 = 0, index2 = 0;
                        int index = blockIdx.x * blockDim.x + threadIdx.x;
                        int stride = blockDim.x * gridDim.x;
                        for (int i = index; i < %(ARRAY_SIZE)s; i+= stride)
                            if (lista[i] < pivo
                            {
                                list1[index2] = listElements[i];
                                index1++;
                            }
                            else
                            {
                                list2[index2] = listElements[i];
                                index2++;
                            }
                    }
                    """
            SIZE = len(listElements)

            listElements = np.asarray(listElements) 
            listElements = listElements.astype(np.int)  
            lista_gpu = cuda.mem_alloc(listElements.nbytes)
            cuda.memcpy_htod(lista_gpu, listElements)  

            list1_gpu = cuda.mem_alloc(listElements.nbytes) 
            list2_gpu = cuda.mem_alloc(listElements.nbytes)  

            BLOCK_SIZE = 256
            NUM_BLOCKS = (SIZE + BLOCK_SIZE - 1) // BLOCK_SIZE
            kernel_code = kernel_code_template % {
                'ARRAY_SIZE': SIZE
            }

            mod = compiler.SourceModule(kernel_code)
            arraysQuick = mod.get_function("separateQuick")

            arraysQuick(lista_gpu, list1_gpu, list2_gpu, pivo, block=(BLOCK_SIZE, 1, 1), grid=(NUM_BLOCKS, 1))

            list1 = list1_gpu.get()
            list2 = list2_gpu.get()

            np.allclose(list1, list1_gpu.get())
            np.allclose(list2, list2_gpu.get())

            return quickSort_paralleloGlobal(list1) + [pivo] + quickSort_paralleloGlobal(list2)

Here is the runtime error:

Traceback (most recent call last):
  File "C:/Users/mateu/Documents/GitHub/ppc_Sorting_and_Merging/quickSort.py", line 104, in <module>
    print(quickSort_paraleloGlobal([1, 5, 4, 2, 0]))
  File "C:/Users/mateu/Documents/GitHub/ppc_Sorting_and_Merging/quickSort.py", line 60, in quickSort_paraleloGlobal
    mod = compiler.SourceModule(kernel_code)
  File "C:\Users\mateu\Documents\GitHub\ppc_Sorting_and_Merging\venv\lib\site-packages\pycuda\compiler.py", line 291, in __init__
    arch, code, cache_dir, include_dirs)
  File "C:\Users\mateu\Documents\GitHub\ppc_Sorting_and_Merging\venv\lib\site-packages\pycuda\compiler.py", line 254, in compile
    return compile_plain(source, options, keep, nvcc, cache_dir, target)
  File "C:\Users\mateu\Documents\GitHub\ppc_Sorting_and_Merging\venv\lib\site-packages\pycuda\compiler.py", line 137, in compile_plain
    stderr=stderr.decode("utf-8", "replace"))
pycuda.driver.CompileError: nvcc compilation of C:\Users\mateu\AppData\Local\Temp\tmpefxgkfkk\kernel.cu failed
[command: nvcc --cubin -arch sm_61 -m64 -Ic:\users\mateu\documents\github\ppc_sorting_and_merging\venv\lib\site-packages\pycuda\cuda kernel.cu]
[stdout:
kernel.cu
]
[stderr:
kernel.cu(10): error: expected a ")"

kernel.cu(19): warning: parsing restarts here after previous syntax error

kernel.cu(19): error: expected a statement

kernel.cu(5): warning: variable "indexMenor" was declared but never referenced

kernel.cu(5): warning: variable "indexMaior" was declared but never referenced

2 errors detected in the compilation of "C:/Users/mateu/AppData/Local/Temp/tmpxft_00004260_00000000-10_kernel.cpp1.ii".
]

Process finished with exit code 1

Solution

There are a number of problems with your code. I don't think I will be able to list them all. However one of the central problems is that you have attempted to do a naive conversion of a serial quicksort into a thread-parallel quicksort, and such a simple conversion is not possible.

To allow threads to work in a parallel fashion, while dividing up an input list into one of two separate output lists, requires a number of changes to your kernel code.

However we can address most of the other issues by limiting your kernel launches to one thread each.

With that idea, the following code appears to sort the given input correctly:

$ cat t18.py
import pycuda.autoinit
import pycuda.driver as cuda
from pycuda import gpuarray, compiler
from pycuda.compiler import SourceModule
import time
import numpy as np


def quickSort_paralleloGlobal(listElements):

        if len(listElements) <= 1:

            return listElements

        else:

            pivo = listElements.pop()
            pivo = np.int32(pivo)

            kernel_code_template = """
                    __global__ void separateQuick(int *listElements, int *list1, int *list2, int *l1_size, int *l2_size, int pivo)
                    {
                        int index1 = 0, index2 = 0;
                        int index = blockIdx.x * blockDim.x + threadIdx.x;
                        int stride = blockDim.x * gridDim.x;
                        for (int i = index; i < %(ARRAY_SIZE)s; i+= stride)
                            if (listElements[i] < pivo)
                            {
                                list1[index1] = listElements[i];
                                index1++;
                            }
                            else
                            {
                                list2[index2] = listElements[i];
                                index2++;
                            }
                        *l1_size = index1;
                        *l2_size = index2;
                    }
                    """
            SIZE = len(listElements)

            listElements = np.asarray(listElements)
            listElements = listElements.astype(np.int32)
            lista_gpu = cuda.mem_alloc(listElements.nbytes)
            cuda.memcpy_htod(lista_gpu, listElements)

            list1_gpu = cuda.mem_alloc(listElements.nbytes)
            list2_gpu = cuda.mem_alloc(listElements.nbytes)
            l1_size   = cuda.mem_alloc(4)
            l2_size   = cuda.mem_alloc(4)
            BLOCK_SIZE = 1
            NUM_BLOCKS = 1
            kernel_code = kernel_code_template % {
                'ARRAY_SIZE': SIZE
            }

            mod = compiler.SourceModule(kernel_code)
            arraysQuick = mod.get_function("separateQuick")

            arraysQuick(lista_gpu, list1_gpu, list2_gpu, l1_size, l2_size, pivo, block=(BLOCK_SIZE, 1, 1), grid=(NUM_BLOCKS, 1))
            l1_sh = np.zeros(1, dtype = np.int32)
            l2_sh = np.zeros(1, dtype = np.int32)
            cuda.memcpy_dtoh(l1_sh, l1_size)
            cuda.memcpy_dtoh(l2_sh, l2_size)
            list1 = np.zeros(l1_sh, dtype=np.int32)
            list2 = np.zeros(l2_sh, dtype=np.int32)
            cuda.memcpy_dtoh(list1, list1_gpu)
            cuda.memcpy_dtoh(list2, list2_gpu)
            list1 = list1.tolist()
            list2 = list2.tolist()
            return quickSort_paralleloGlobal(list1) + [pivo] + quickSort_paralleloGlobal(list2)

print(quickSort_paralleloGlobal([1, 5, 4, 2, 0]))
$ python t18.py
[0, 1, 2, 4, 5]
$

The next step in the porting process would be to convert your naive serial kernel to one that could operate in a thread-parallel fashion. One relatively simple approach would be to use atomics to manage all output data (both lists, as well as updates to the sizes of each list).

Here is one possible approach:

$ cat t18.py
import pycuda.autoinit
import pycuda.driver as cuda
from pycuda import gpuarray, compiler
from pycuda.compiler import SourceModule
import time
import numpy as np


def quickSort_paralleloGlobal(listElements):

        if len(listElements) <= 1:

            return listElements

        else:

            pivo = listElements.pop()
            pivo = np.int32(pivo)

            kernel_code_template = """
                    __global__ void separateQuick(int *listElements, int *list1, int *list2, int *l1_size, int *l2_size, int pivo)
                    {
                        int index = blockIdx.x * blockDim.x + threadIdx.x;
                        int stride = blockDim.x * gridDim.x;
                        for (int i = index; i < %(ARRAY_SIZE)s; i+= stride)
                            if (listElements[i] < pivo)
                            {
                                list1[atomicAdd(l1_size, 1)] = listElements[i];
                            }
                            else
                            {
                                list2[atomicAdd(l2_size, 1)] = listElements[i];
                            }
                    }
                    """
            SIZE = len(listElements)

            listElements = np.asarray(listElements)
            listElements = listElements.astype(np.int32)
            lista_gpu = cuda.mem_alloc(listElements.nbytes)
            cuda.memcpy_htod(lista_gpu, listElements)

            list1_gpu = cuda.mem_alloc(listElements.nbytes)
            list2_gpu = cuda.mem_alloc(listElements.nbytes)
            l1_size   = cuda.mem_alloc(4)
            l2_size   = cuda.mem_alloc(4)
            BLOCK_SIZE = 256
            NUM_BLOCKS = (SIZE + BLOCK_SIZE - 1) // BLOCK_SIZE
            kernel_code = kernel_code_template % {
                'ARRAY_SIZE': SIZE
            }

            mod = compiler.SourceModule(kernel_code)
            arraysQuick = mod.get_function("separateQuick")
            l1_sh = np.zeros(1, dtype = np.int32)
            l2_sh = np.zeros(1, dtype = np.int32)
            cuda.memcpy_htod(l1_size, l1_sh)
            cuda.memcpy_htod(l2_size, l2_sh)
            arraysQuick(lista_gpu, list1_gpu, list2_gpu, l1_size, l2_size, pivo, block=(BLOCK_SIZE, 1, 1), grid=(NUM_BLOCKS, 1))
            cuda.memcpy_dtoh(l1_sh, l1_size)
            cuda.memcpy_dtoh(l2_sh, l2_size)
            list1 = np.zeros(l1_sh, dtype=np.int32)
            list2 = np.zeros(l2_sh, dtype=np.int32)
            cuda.memcpy_dtoh(list1, list1_gpu)
            cuda.memcpy_dtoh(list2, list2_gpu)
            list1 = list1.tolist()
            list2 = list2.tolist()
            return quickSort_paralleloGlobal(list1) + [pivo] + quickSort_paralleloGlobal(list2)

print(quickSort_paralleloGlobal([1, 5, 4, 2, 0]))
$ python t18.py
[0, 1, 2, 4, 5]
$

I'm not suggesting that the above examples are perfect or defect free. Also, I have not identified each and every change I made to your code. I suggest you study the differences between these examples and your posted code.

I should also mention that this isn't a fast or efficient way to sort numbers on the GPU. I assume this is for a learning exercise. If you're interested in fast parallel sorting, you are encouraged to use a library implementation. If you want to do this from python, one possible implementation is provided by cupy