Parallel radix sort with virtual memory and write-combining

I'm attempting to implement the variant of parallel radix sort described in http://arxiv.org/pdf/1008.2849v2.pdf (Algorithm 2), but my C++ implementation (for 4 digits in base 10) contains a bug that I'm unable to locate.

For debugging purposes I'm using no parallelism, but the code should still sort correctly.

For instance the line arr.at(i) = item accesses indices outside its bounds in the following

std::vector<int> v = {4612, 4598};
radix_sort2(v);

My implementation is as follows

#include <set>
#include <array>
#include <vector>

void radix_sort2(std::vector<int>& arr) {
    std::array<std::set<int>, 10> buckets3;

    for (const int item : arr) {
        int d = item / 1000;
        buckets3.at(d).insert(item);
    }

    //Prefix sum
    std::array<int, 10> outputIndices;
    outputIndices.at(0) = 0;
    for (int i = 1; i < 10; ++i) {
        outputIndices.at(i) = outputIndices.at(i - 1) +
            buckets3.at(i - 1).size();
    }

    for (const auto& bucket3 : buckets3) {
        std::array<std::set<int>, 10> buckets0, buckets1;
        std::array<int, 10> histogram2 = {};

        for (const int item : bucket3) {
            int d = item % 10;
            buckets0.at(d).insert(item);
        }
        for (const auto& bucket0 : buckets0) {
            for (const int item : bucket0) {
                int d = (item / 10) % 10;
                buckets1.at(d).insert(item);

                int d2 = (item / 100) % 10;
                ++histogram2.at(d2);
            }
        }

        for (const auto& bucket1 : buckets1) {
            for (const int item : bucket1) {
                int d = (item / 100) % 10;
                int i = outputIndices.at(d) + histogram2.at(d);
                ++histogram2.at(d);
                arr.at(i) = item;
            }
        }
    }
}

Can anyone spot my mistake?

Solution

I took at look at the paper you linked. You haven't made any mistakes, none that I can see. In fact, in my estimation, you corrected a mistake in the algorithm.

I wrote out the algorithm and ended up with the exact same problem as you did. After reviewing Algorithm 2, either I woefully mis-understand how it is supposed to work, or it is flawed. There are at least a couple of problems with the algorithm, specifically revolving around outputIndices, and histogram2.

Looking at the algorithm, the final index of an item is determined by the counting sort stored in outputIndices. (lets ignore the histogram for now). If you had an inital array of numbers {0100, 0103, 0102, 0101} The prefix sum of that would be 4. The algorithm makes no indication I can determine to lag the result by 1. That being said, in order for the algorithm to work the way they intend, it does have to be lagged, so, moving on. Now, the prefix sums are 0, 4, 4.... The algorithm doesn't use the MSD as the index into the outputIndices array, it uses "MSD - 1"; So taking 1 as the index into the array, the starting index for the first item without the histogram is 4! Outside the array on the first try. The outputIndices is built with the MSD, it makes sense for it to be accessed by MSD.

Further, even if you tweak the algorithm to correctly to use the MSD into the outputIndices, it still won't sort correctly. With your initial inputs (swapped) {4598, 4612}, they will stay in that order. They are sorted (locally) as if they are 2 digit numbers. If you increase it to have other numbers not starting with 4, they will be globally, sorted, but the local sort is never finished. According to the paper the goal is to use the histogram to do that, but I don't see that happening.

Ultimately, I'm assuming, what you want is an algorithm that works the way described. I've modified the algorithm, keeping with the overall stated goal of the paper of using the MSD to do a global sort, and the rest of the digits by reverse LSD. I don't think these changes should have any impact on your desire to parallel-ize the function.

void radix_sort2(std::vector<int>& arr)
{
    std::array<std::vector<int>, 10> buckets3;

    for (const int item : arr)
    {
        int d = item / 1000;
        buckets3.at(d).push_back(item);
    }

    //Prefix sum
    std::array<int, 10> outputIndices;
    outputIndices.at(0) = 0;

    for (int i = 1; i < 10; ++i)
    {
        outputIndices.at(i) = outputIndices.at(i - 1) + buckets3.at(i - 1).size();
    }

    for (const auto& bucket3 : buckets3)
    {       
        if (bucket3.size() <= 0)
            continue;

        std::array<std::vector<int>, 10> buckets0, buckets1, buckets2;

        for (const int item : bucket3)
            buckets0.at(item % 10).push_back(item);

        for (const auto& bucket0 : buckets0)
            for (const int item : bucket0)
                buckets1.at((item / 10) % 10).push_back(item);

        for (const auto& bucket1 : buckets1)
            for (const int item : bucket1)
                buckets2.at((item / 100) % 10).push_back(item);

        int count = 0;

        for (const auto& bucket2 : buckets2)
        {
            for (const int item : bucket2)
            {
                int d = (item / 1000) % 10;
                int i = outputIndices.at(d) + count;
                ++count;
                arr.at(i) = item;
            }
        }
    }
}

For extensiblility, it would probably make sense to create a helper function that does the local sorting. You should be able to extend it to handle any number of digit numbers that way.