Search code examples
c++optimizationquicksortloop-unrolling

Manual loop unrolling within a C++ Introsort Runs Incorrectly


I'm writing a simple in-place introsort in C++, in which I'm trying to manually unroll a loop within the partition function for the sake of optimization. The program, which I'll include below, compiles but isn't able to sort a random list correctly.

This program is being compiled down for RISC-V architecture, and even under -Ofast, (riscv-64-unknown-elf-gcc) gcc doesn't seem to be unrolling the loop on its own, making a manual check every cycle through to ensure the end condition is met. I'd like to space this check out to try and maximize performance - it's my understanding that some compilers wind up doing this by default.

I've tried breaking this loop up into chunks of 5, to prove the concept before I go further (perhaps with multiple segments, e.g. try going through groups of 32 then try going through groups of 16 etc.), then doing the last few elements of the array as I have previously. Before unrolling the program worked fine, but now the sort fails and I'm not sure how to proceed.

Here's the partition function in question:

int* partition(int *startptr, int *endptr) {
    int x = *endptr; // threshold
    int *j, tmp, tmp2, *i = startptr - 1;
    for (j = startptr; j+5 < endptr; j+=5) {

        int pj = *j;
        if (pj <= x) {
            i += 1;
            tmp = *i;
            *i = pj;
            *j = tmp;
        }

        pj = j[1];
        if (pj <= x) {
            i += 1;
            tmp = *i;
            *i = pj;
            *j = tmp; }

        pj = j[2];
        if (pj <= x) {
            i += 1;
            tmp = *i;
            *i = pj;
            *j = tmp; }

        pj = j[3];
        if (pj <= x) {
            i += 1;
            tmp = *i;
            *i = pj;
            *j = tmp; }

        pj = j[4];
        if (pj <= x) {
            i += 1;
            tmp = *i;
            *i = pj;
            *j = tmp; }
        }

    j -= 5; 
    for (int *y = j; y < endptr; y++) {
        int py = y[0];
        if (py <= x) {
            i += 1;
            tmp = *i;
            *i = py;
            *y = tmp;
            } 
        }

    int *incrementedi = i + 1;
    tmp = *incrementedi;   //p[i+1]
    tmp2 = *endptr; //p[end]
    *endptr = tmp;
    *incrementedi = tmp2;
    return i + 1;
 }

At the end of the program, I print out the array and loop through, asserting that's it in ascending order as anticipated. The output appears sorted in small chunks, but it's not fully accurate, and I'm not sure how to proceed. Thank you!


Edit for clarification: I'm verifying that the loop is not in fact unrolling by looking at the output of ...-gcc -S. The partition function is inlining nicely but it still performs the check over every iteration.

It's worth noting that I'm using pointers whenever possible for a similar reason - the compiler isn't optimizing for the instruction savings we get when we don't have to convert array indices to actual pointers.


Solution

  • This example code works, about 11% faster in 64 bit mode (more registers). The compiler optimized the compare and conditional copy of pj[...] via tmp to use a register (and it cycled through registers to allow some overlap).

    int * Partition(int *plo, int *phi)
    {
        int *pi = plo;
        int *pj = plo;
        int pvt = *phi;
        int tmp;
        int *ph8 = phi - 8;
        for (pj = plo; pj < ph8; pj += 8)
        {
            if (pj[0] < pvt)
            {
                tmp = pj[0];
                pj[0] = *pi;
                *pi = tmp;
                ++pi;
            }
            if (pj[1] < pvt)
            {
                tmp = pj[1];
                pj[1] = *pi;
                *pi = tmp;
                ++pi;
            }
            if (pj[2] < pvt)
            {
                tmp = pj[2];
                pj[2] = *pi;
                *pi = tmp;
                ++pi;
            }
            if (pj[3] < pvt)
            {
                tmp = pj[3];
                pj[3] = *pi;
                *pi = tmp;
                ++pi;
            }
            if (pj[4] < pvt)
            {
                tmp = pj[4];
                pj[4] = *pi;
                *pi = tmp;
                ++pi;
            }
            if (pj[5] < pvt)
            {
                tmp = pj[5];
                pj[5] = *pi;
                *pi = tmp;
                ++pi;
            }
            if (pj[6] < pvt)
            {
                tmp = pj[6];
                pj[6] = *pi;
                *pi = tmp;
                ++pi;
            }
            if (pj[7] < pvt)
            {
                tmp = pj[7];
                pj[7] = *pi;
                *pi = tmp;
                ++pi;
            }
        }
        for (; pj < phi; ++pj)
        {
            if (*pj < pvt)
            {
                tmp = *pj;
                *pj = *pi;
                *pi = tmp;
                ++pi;
            }
        }
        tmp  = *phi;
        *phi = *pi;
        *pi  = tmp;
        return pi;
    }
    
    void QuickSort(int *plo, int *phi)
    {
    int *p;
        if (plo < phi)
        {
            p = Partition(plo, phi);
            QuickSort(plo, p-1);
            QuickSort(p+1, phi);
        }
    }