CUDA: Sorting a vector< vector<int> > on the GPU

I have implemented my own comparator for STL's sort function, that helps sorting a std::vector< std::vector<int> > on the CPU. The user gives as an input a std::vector< std::vector<int> > and also a string variable like for example 021. By having this string the sorting is done first on the first column, then on the third column and then on the second column. Example:

Let's say the string is 10

The output will be

My CPU implementation uses a class called Sorting, this class is implemented with the following two files:

Sorting.h

class Sorting{

   private:

   public:

Sorting();
~Sorting();
std::vector<std::vector<int>> applySort(std::vector<std::vector<int>>
    data,const std::string& attr);

 };

Sorting.cpp

 Sorting::Sorting(){}

 Sorting::~Sorting(){}


 std::vector<std::vector<int>> Sorting::applySort(
      std::vector<std::vector<int>> data, const std::string& attr){

         std::sort(data.begin(), data.begin()+data.size(), Comparator(attr));

         return data;
  }

Comparator.h

   class Comparator{

   private:
     std::string attr;

   public:
     Comparator(const std::string& attr) { this->attr = attr; }

     bool operator()(const std::vector<int>& first, const std::vector<int>&
          second){

    size_t i;
    for(i=0;i<attr.size();i++){
        if(first[attr.at(i) - '0'] < second[attr.at(i) - '0']) return true;
        else if(first[attr.at(i) - '0'] > second[attr.at(i)-'0'])          
             return  false;
    }
    return false;
  }

  };

My implementation has been tested and it works properly. I'm interested in doing a similar CUDA implementation that would take advantage of GPU's capabilities in order to produce the output much more faster.

Initially I thought because my goal is a bit confusing, maybe changing an already known implementation for sorting in the GPU would do my job. However I started searching many implementations like the one described here: http://blogs.nvidia.com/2012/09/how-tesla-k20-speeds-up-quicksort-a-familiar-comp-sci-code/ and it made me realize that it would be something difficult to achieve.

I'm not sure if this is the best course of action. I started searching for libraries and found Thrust. However, although Thrust allows you to define your own comparator, in a question I asked yesterday I learned that it's not possible to create a host_vector < host_vector<int> >.

And I guess transforming my vector of vectors to a single vector wouldn't help me that much because I have no idea how I would then have to implement my comparator class.

I would like to hear your opinion on this issue:

how should I approach this problem?
Is it possible to achieve it with Thrust?
Would doing it in the GPU give a much better performance on my overall code? Note that the vector of vectors can be huge, millions of rows but only a few(5-10) columns.
Would it be better to design my own sort or change a sort function that is already available? This although sounds like a good idea, in practice I feel like it would take me a lot of effort to achieve. Using a simple comparator and a sort function from a library would be the best for me, however the limitations of Thrust don't allow me to do so.

Thank you in advance

Solution

i see that you are trying to implement a lexicographical sorting technique(but i have did it with single 1D huge vector), well i have been there and i have implemented a function which sorts the vectors but actually it is lagging way behind the lexicographical sorting, anyways i am not sure if i can post the code here, so if you need any help i would be glad to help

PS: look into the implementation of lexicographical_sort.cu in thrust example code (i have tweaked it also but that one also is lagging behind) The comparator function you might need in order to check from two distintive places in 1D vector (which contains all the data) is listed down (by the way, this technique is way slower then CPU) but who know you might come up with idea to improve it or use it better then i do

struct arbitrary_functor
{
    template <typename Tuple>   __host__ __device__
        void operator()(Tuple t)
        {
            if(thrust::get<0>(t)>thrust::get<1>(t))
                thrust::get<2>(t) = 1;
            else
                thrust::get<2>(t) = 0;
        }
};

int checkLexo_1vec(const thrust::device_vector<char> & A, int bv1, int bv2, int N){
    int i;

    thrust::device_vector<char> temp(N);
    thrust::device_vector<char> sum(N);

    thrust::for_each(thrust::make_zip_iterator(
                        thrust::make_tuple(A.begin()+bv2, A.begin()+bv1,temp.begin())),
                    thrust::make_zip_iterator(
                        thrust::make_tuple(A.end()+(bv2+N),A.end()+(bv1+N), temp.end())),
                    arbitrary_functor());


    thrust::inclusive_scan(temp.begin(),temp.end(),sum.begin());

    int a = thrust::lower_bound(sum.begin(),sum.end(),1) - sum.begin();

    thrust::for_each(thrust::make_zip_iterator(
                        thrust::make_tuple(A.begin()+bv1, A.begin()+bv2, temp.begin())),
                    thrust::make_zip_iterator(
                        thrust::make_tuple(A.end()+(bv1+N), A.end()+(bv2+N),temp.end())),
                    arbitrary_functor());

    thrust::inclusive_scan(temp.begin(),temp.end(),sum.begin());

    int b = thrust::lower_bound(sum.begin(),sum.end(),1) - sum.begin();

    if(a<=b)
        return 1;
    else
        return 0;
}