Cuda Thrust - How to optimize a code using sort_by_key, merge_by_key and reduce_by_key

I am using c++ and cuda/thrust to perform a calculation on the GPU, which is a new field for me. Unfortunately, my code (MCVE below) is not very efficient, so I would like to know how to optimize it. The code performs the following operations:

There are two key vector and two value vector. The key vectors contain basically the i and j of an upper triangular matrix (in this example: of size 4x4).

key1 {0, 0, 0, 1, 1, 2} value1: {0.5, 0.5, 0.5, -1.0, -1.0, 2.0}
key2 {1, 2, 3, 2, 3, 3} value2: {-1, 2.0, -3.5, 2.0, -3.5, -3.5}

The task is to sum over all values which have the same key. To achieve that, I sorted the second value vector using sort_by_key. The result is:

key1 {0, 0, 0, 1, 1, 2} value1: {0.5, 0.5, 0.5, -1.0, -1.0, 2.0}
key2 {1, 2, 2, 3, 3, 3} value2: {-1.0, 2.0, 2.0, -3.5, -3.5, -3.5}

After that, I merged both value vector using merge_by_key, which leads to a new key and value vector with a size double as big than before.

key_merge {0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3}
value_merge {0.5, 0.5, 0.5, -1.0, -1.0, -1.0, 2.0, 2.0, 2.0, -3.5, -3.5, -3.5}

The last step is to use reduce_by_key to sum over all values with the same key. The result is:

key {0, 1, 2, 3} value: {1.5, -3.0, 6.0, -10.5}

The code below which perform this operations is quiet slowly and I am afraid that the performance for larger size will be bad. How can it be optimized? Is it possible to fusion sort_by_key, merge_by_key and reduce_by_key? Since I know the resulting key vector from sort_by_key in advance, is it possible to transform the value vector "from an old to a new key"? Does it make sense, to merge two vectors before reducing them or is it faster to use reduce_by_key separately for each pair of value/key vector? Is it possible to speed up the reduce_by_key calculation by using the fact, that here the number of different key value is known and the number of equal keys is always the same?

#include <stdio.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/reduce.h>
#include <thrust/merge.h>

int main(){
   int key_1[6] = {0, 0, 0, 1, 1, 2};
   int key_2[6] = {1, 2, 3, 2, 3, 3};
   thrust::device_vector<double> k1(key_1,key_1+6);
   thrust::device_vector<double> k2(key_2,key_2+6);

   double value_1[6] = {0.5, 0.5, 0.5, -1.0, -1.0, 2.0};
   double value_2[6] = {-1, 2.0, -3.5, 2.0, -3.5, -3.5};
   thrust::device_vector<double> v1(value_1,value_1+6);
   thrust::device_vector<double> v2(value_2,value_2+6);

   thrust::device_vector<double> mk(12);
   thrust::device_vector<double> mv(12);
   thrust::device_vector<double> rk(4);
   thrust::device_vector<double> rv(4);

   thrust::sort_by_key(k2.begin(), k2.end(), v2.begin());
   thrust::merge_by_key(k1.begin(), k1.end(), k2.begin(), k2.end(),v1.begin(), v2.begin(), mk.begin(), mv.begin());
   thrust::reduce_by_key(mk.begin(), mk.end(), mv.begin(), rk.begin(), rv.begin());

   for (unsigned i=0; i<4; i++) {
     double tmp1 = rk[i];
     double tmp2 = rv[i];
     printf("key value %f is related to %f\n", tmp1, tmp2);
   }
   return 0;
}

Result:

key value 0.000000 is related to 1.500000
key value 1.000000 is related to -3.000000
key value 2.000000 is related to 6.000000
key value 3.000000 is related to -10.500000

Solution

Here is one possible approach that I think might be quicker than your sequence. The key idea is that we want to avoid sorting data where we know the order ahead of time. If we can leverage the order knowledge that we have, instead of sorting the data, we can simply reorder it into the desired arrangement.

Let's make some observations about the data. If your key1 and key2 are in fact the i,j indices of an upper triangular matrix, then we can make some observations about the concatenation of these two vectors:

The concatenated vector will contain equal numbers of each key. (I believe you may have pointed this out in your question.) So in your case, the vector will contain three 0 keys, three 1 keys, three 2 keys, and three 3 keys. I believe this pattern should hold for any upper triangular pattern independent of matrix dimension. So a matrix of dimension N that is upper triangular will have N sets of keys in the concatenated index vector, each set consisting of N-1 like elements.
In the concatenated vector, we can discover/establish a consistent ordering of keys (based on matrix dimension N), which allows us to reorder the vector in like-key-grouped order, without resorting to a traditional sort operation.

If we combine the above 2 ideas, then we can probably solve the entire problem with some scatter operations to replace the sort/merge activity, followed by the thrust::reduce_by_key operation. The scatter operations can be accomplished with thrust::copy to an appropriate thrust::permutation_iterator combined with an appropriate index calculation functor. Since we know exactly what the reordered concatenated key vector will look like (in your dimension-4 example: {0,0,0,1,1,1,2,2,2,3,3,3}), we need not perform the reordering explicitly on it. However we must reorder the value vector using the same mapping. So let's develop the arithmetic for that mapping:

dimension (N=)4 example

vector index:          0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11
desired (group) order: 0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3
concatenated keys:     0, 0, 0, 1, 1, 2, 1, 2, 3, 2, 3, 3
group start idx:       0, 0, 0, 3, 3, 6, 3, 6, 9, 6, 9, 9
group offset idx:      0, 1, 2, 0, 1, 0, 2, 1, 0, 2, 1, 2
destination idx:       0, 1, 2, 3, 4, 6, 5, 7, 9, 8,10,11


dimension (N=)5 example

vector index:          0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19
desired (group) order: 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4
concatenated keys:     0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4
group start idx:       0, 0, 0, 0, 4, 4, 4, 8, 8,12, 4, 8,12,16, 8,12,16,12,16,16
group offset idx:      0, 1, 2, 3, 0, 1, 2, 0, 1, 0, 3, 2, 1, 0, 3, 2, 1, 3, 2, 3 
destination idx:       0, 1, 2, 3, 4, 5, 6,10, 7, 8,11,14, 9,12,15,17,13,16,18,19

We can observe that in each case, the destination index (i.e. the location to move the selected key or value to, for the desired group order) is equal to the group start index plus the group offset index. The group start index is simply the key multiplied by (N-1). The group offset index is a pattern similar to an upper or lower triangular index pattern (in 2 different incarnations, for each half of the concatenated vector). The concatenated keys is simply the concatenation of the key1 and key2 vectors (we will create this concatenation virtually using permutation_iterator). The desired group order is known a-priori, it is simply a sequence of integer groups, with N groups consisting of N-1 elements each. It is equivalent to the sorted version of the concatenated key vector. Therefore we can directly compute the destination index in a functor.

For the creation of the group offset index patterns, we can subtract your two key vectors (and subtract an additional 1):

key2:                  1, 2, 3, 2, 3, 3
key1:                  0, 0, 0, 1, 1, 2
key1+1:                1, 1, 1, 2, 2, 3
p1 = key2-(key1+1):    0, 1, 2, 0, 1, 0
p2 = (N-2)-p1:         2, 1, 0, 2, 1, 2
grp offset idx=p1|p2:  0, 1, 2, 0, 1, 0, 2, 1, 0, 2, 1, 2

Here is a fully-worked example demonstrating the above concepts using your example data:

$ cat t1133.cu
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <thrust/copy.h>
#include <thrust/transform.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <iostream>

// "triangular sort" index generator
struct idx_functor
{
  int n;
  idx_functor(int _n): n(_n) {};
  template <typename T>
  __host__ __device__
  int operator()(const T &t){
    int k1 = thrust::get<0>(t);
    int k2 = thrust::get<1>(t);
    int id = thrust::get<2>(t);
    int go,k;
    if (id < (n*(n-1))/2){ // first half
      go = k2-k1-1;
      k  = k1;
      }
    else { // second half
      go = n-k2+k1-1;
      k  = k2;
      }
    return k*(n-1)+go;
  }
};


const int N = 4;
using namespace thrust::placeholders;

int main(){
   // useful dimensions
   int d1 = N*(N-1);
   int d2 = d1/2;
   // iniitialize keys
   int key_1[] = {0, 0, 0, 1, 1, 2};
   int key_2[] = {1, 2, 3, 2, 3, 3};
   thrust::device_vector<int> k1(key_1, key_1+d2);
   thrust::device_vector<int> k2(key_2, key_2+d2);
   // initialize values
   double value_1[] = {0.5, 0.5, 0.5, -1.0, -1.0, 2.0};
   double value_2[] = {-1, 2.0, -3.5, 2.0, -3.5, -3.5};
   thrust::device_vector<double> v(d1);
   thrust::device_vector<double> vg(d1);
   thrust::copy_n(value_1, d2, v.begin());
   thrust::copy_n(value_2, d2, v.begin()+d2);
   // reorder (group) values by key
   thrust::copy(v.begin(), v.end(), thrust::make_permutation_iterator(vg.begin(), thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(k1.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1%d2)), thrust::make_permutation_iterator(k2.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1%d2)), thrust::counting_iterator<int>(0))), idx_functor(N))));
   // sum results
   thrust::device_vector<double> rv(N);
   thrust::device_vector<int> rk(N);
   thrust::reduce_by_key(thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1/(N-1)), thrust::make_transform_iterator(thrust::counting_iterator<int>(d1), _1/(N-1)), vg.begin(), rk.begin(), rv.begin());
   // print results
   std::cout << "Keys:" << std::endl;
   thrust::copy_n(rk.begin(), N, std::ostream_iterator<int>(std::cout, ", "));
   std::cout << std::endl <<  "Sums:" << std::endl;
   thrust::copy_n(rv.begin(), N, std::ostream_iterator<double>(std::cout, ", "));
   std::cout << std::endl;
   return 0;
}
$ nvcc -std=c++11 -o t1133 t1133.cu
$ ./t1133
Keys:
0, 1, 2, 3,
Sums:
1.5, -3, 6, -10.5,
$

The net effect is that your thrust::sort_by_key and thrust::merge_by_key operations have been replaced by a single thrust::copy operation which should be more efficient.