Can Thrust transform_reduce work with 2 arrays?

I found what Thrust can provide is quite limited, as below code shows: I end up to have 9*9*2 (1 multiple + 1 reduce) Thrust calls, which is 162 kernel launches. While if I write my own kernel, only 1 kernel launch needed.

for(i=1;i<=9;i++) 
{
    for(j=i;j<=9;j++)
    {
        ATA[i][j]=0;
        for(m=1;m<=50000;m++)
            ATA[i][j]=ATA[i][j]+X[idx0[i]][m]*X[idx0[j]][m];
    }
}

Then I end up with below Thrust implementation:

for(i=1;i<=dim0;i++)
{
    for(j=i;j<=dim0;j++)
    {
        thrust::transform(t_d_X+(idx0[i]-1)*(1+iNumPaths)+1, t_d_X+(idx0[i]-1)*(1+iNumPaths)+iNumPaths+1, t_d_X+(idx0[j]-1)*(1+iNumPaths)+1,t_d_cdataMulti, thrust::multiplies<double>());
        ATA[i][j] = thrust::reduce(t_d_cdataMulti, t_d_cdataMulti+iNumPaths, (double) 0, thrust::plus<double>());
    }
}

Some analysis:

transform_reduce: will NOT help, as there is a pointer redirect idx0[i], and basically there are 2 arrays involved. 1st one is X[idx0[i]], 2nd one is X[idx0[j]]
reduce_by_key: will help. But I need to store all interim results into one big array, and prepare a huge mapping key table with same size. Will try it out.
transform_iterator: will NOT help, same reason as 1.

Think I can't avoid writing my own kernel?

Solution

I'll bet @m.s. can provide a more efficient approach. But here is one possible approach. In order to get the entire computation reduced to a single kernel call by thrust, it is necessary to handle everything with a single thrust algorithm call. At the heart of the operation, we are summing many computations together, to fill a matrix. Therefore I believe thrust::reduce_by_key is an appropriate thrust algorithm to use. This means we must realize all other transformations using various thrust "fancy iterators", which are mostly covered in the thrust getting started guide.

Attempting to do this (handle everything with a single kernel call) makes the code very dense and hard to read. I don't normally like to demonstrate thrust this way, but since it is the crux of your question, it cannot be avoided. Therefore let's unpack the sequence of operations contained in the call to reduce_by_key, approximately from the inward out. The general basis of this algorithm is to "flatten" all data into a single long logical vector. Let's assume for understanding that our square matrix dimensions are only 2x2 and the length of our m vector is 3. You can think of the "flattening" or linear-index conversion like this:

linear index: 0 1 2 3 4 5 6 7 8 9 10 11
     i index: 0 0 0 0 0 0 1 1 1 1 1  1
     j index: 0 0 0 1 1 1 0 0 0 1 1  1
     m index: 0 1 2 0 1 2 0 1 2 0 1  2
     k index: 0 0 0 1 1 1 2 2 2 3 3  3

The "k index" above is our keys that will ultimately be used by reduce_by_key to collect product terms together, for each element of the matrix. Note that the code has EXT_I, EXT_J, EXT_M, and EXT_K helper macros which will define, using thrust placeholders, the operation to be performed on the linear index (created using a counting_iterator) to produce the various other "indices".

The first thing we will need to do is construct a suitable thrust operation to convert the linear index into the transformed value of idx0[i] (again, working from "inward to outward"). We can do this with a permutation iterator on idx0 vector, with a transform_iterator supplying the "map" for the permutation iterator - this transform iterator just converts the linear index (mb) to an "i" index:
```
thrust::make_permutation_iterator(d_idx0.begin(), thrust::make_transform_iterator(mb, EXT_I))
```
Now we need to combine the result from step 1 with the other index - m in this case, to generate a linearized version of the 2D index into X (d_X is the vector-linearized version of X). To do this, we will combine the result of step one in a zip_iterator with another transform iterator that creates the m index. This zip_iterator will be passed to a transform_iterator which takes the two indices and converts it into a linearized index to "look into" the d_X vector:
```
thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_idx0.begin(), thrust::make_transform_iterator(mb, EXT_I)), thrust::make_transform_iterator(mb, EXT_M))), create_Xidx()))
```
create_Xidx is the functor that takes the two computed indices and converts it into the linear index into d_X
With the result from step 2, we can then use a permutation iterator to grab the appropriate value from d_X for the first term in the multiplication:
```
thrust::make_permutation_iterator(d_X.begin(), {code from step 2})
```
repeat steps 1,2,3, using EXT_J instead of EXT_I, to create the second term in the multiplication:
```
X[idx0[i]][m]*X[idx0[j]][m]
```
Place the terms created in step 3 and 4 into a zip_iterator, for use by the transform_iterator that will multiply the two together (using my_mult functor) to create the actual product:
```
thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple({result from step 3}, {result from step 4}, my_mult())
```
The remainder of the reduce_by_key is fairly straightforward. We create the keys index as described previously, and then use it to sum together the various products for each element of the square matrix.

Here is a fully worked example:

$ cat t875.cu
#include <iostream>
#include <thrust/reduce.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>

// rows
#define D1 9
// cols
#define D2 9
// size of m
#define D3 50
// helpers to convert linear indices to i,j,m or "key" indices
#define EXT_I (_1/(D2*D3))
#define EXT_J ((_1/(D3))%D2)
#define EXT_M (_1%D3)
#define EXT_K (_1/D3)


void test_cpu(float ATA[][D2], float X[][D3], int idx0[]){
  for(int i=0;i<D1;i++)
  {
    for(int j=0;j<D2;j++)
    {
        ATA[i][j]=0;
        for(int m=0;m<D3;m++)
            ATA[i][j]=ATA[i][j]+X[idx0[i]][m]*X[idx0[j]][m];
    }
  }
}

using namespace thrust::placeholders;

struct create_Xidx : public thrust::unary_function<thrust::tuple<int, int>, int>{
__host__ __device__
  int operator()(thrust::tuple<int, int> &my_tuple){
    return (thrust::get<0>(my_tuple) * D3) + thrust::get<1>(my_tuple);
    }
};

struct my_mult : public thrust::unary_function<thrust::tuple<float, float>, float>{
__host__ __device__
  float operator()(thrust::tuple<float, float> &my_tuple){
    return thrust::get<0>(my_tuple) * thrust::get<1>(my_tuple);
    }
};

int main(){

  //synthesize data

  float ATA[D1][D2];
  float X[D1][D3];
  int idx0[D1];
  thrust::host_vector<float>  h_X(D1*D3);
  thrust::host_vector<int>    h_idx0(D1);
  for (int i = 0; i < D1; i++){
    idx0[i] = (i + 2)%D1; h_idx0[i] = idx0[i];
    for (int j = 0; j < D2; j++) {ATA[i][j] = 0;}
    for (int j = 0; j < D3; j++) {X[i][j] = j%(i+1); h_X[i*D3+j] = X[i][j];}}

  thrust::device_vector<float>  d_ATA(D1*D2);
  thrust::device_vector<float>  d_X = h_X;
  thrust::device_vector<int>    d_idx0 = h_idx0;

  // helpers
  thrust::counting_iterator<int> mb = thrust::make_counting_iterator(0);
  thrust::counting_iterator<int> me = thrust::make_counting_iterator(D1*D2*D3);

  // perform computation

  thrust::reduce_by_key(thrust::make_transform_iterator(mb, EXT_K), thrust::make_transform_iterator(me, EXT_K), thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_X.begin(), thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_idx0.begin(), thrust::make_transform_iterator(mb, EXT_I)), thrust::make_transform_iterator(mb, EXT_M))), create_Xidx())), thrust::make_permutation_iterator(d_X.begin(), thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_idx0.begin(), thrust::make_transform_iterator(mb, EXT_J)), thrust::make_transform_iterator(mb, EXT_M))), create_Xidx())))), my_mult()), thrust::make_discard_iterator(), d_ATA.begin());


  thrust::host_vector<float>  h_ATA = d_ATA;
  test_cpu(ATA, X, idx0);
  std::cout << "GPU:        CPU: " << std::endl;
  for (int i = 0; i < D1*D2; i++)
    std::cout << i/D1 << "," << i%D2 << ":" << h_ATA[i] << "      " << ATA[i/D1][i%D2] << std::endl;
}
$ nvcc -o t875 t875.cu
$ ./t875
GPU:        CPU:
0,0:81      81
0,1:73      73
0,2:99      99
0,3:153      153
0,4:145      145
0,5:169      169
0,6:219      219
0,7:0      0
0,8:25      25
1,0:73      73
1,1:169      169
1,2:146      146
1,3:193      193
1,4:212      212
1,5:313      313
1,6:280      280
1,7:0      0
1,8:49      49
2,0:99      99
2,1:146      146
2,2:300      300
2,3:234      234
2,4:289      289
2,5:334      334
2,6:390      390
2,7:0      0
2,8:50      50
3,0:153      153
3,1:193      193
3,2:234      234
3,3:441      441
3,4:370      370
3,5:433      433
3,6:480      480
3,7:0      0
3,8:73      73
4,0:145      145
4,1:212      212
4,2:289      289
4,3:370      370
4,4:637      637
4,5:476      476
4,6:547      547
4,7:0      0
4,8:72      72
5,0:169      169
5,1:313      313
5,2:334      334
5,3:433      433
5,4:476      476
5,5:841      841
5,6:604      604
5,7:0      0
5,8:97      97
6,0:219      219
6,1:280      280
6,2:390      390
6,3:480      480
6,4:547      547
6,5:604      604
6,6:1050      1050
6,7:0      0
6,8:94      94
7,0:0      0
7,1:0      0
7,2:0      0
7,3:0      0
7,4:0      0
7,5:0      0
7,6:0      0
7,7:0      0
7,8:0      0
8,0:25      25
8,1:49      49
8,2:50      50
8,3:73      73
8,4:72      72
8,5:97      97
8,6:94      94
8,7:0      0
8,8:25      25
$

Notes:

If you profile the above code with e.g. nvprof --print-gpu-trace ./t875, you will witness two kernel calls. The first is associated with the device_vector creation. The second kernel call handles the entire reduce_by_key operation.
I don't know if all this is slower or faster than your CUDA kernel, since you haven't provided it. Sometimes, expertly written CUDA kernels can be faster than thrust algorithms doing the same operation.
It's quite possible that what I have here is not precisely the algorithm you had in mind. For example, your code suggests you're only filling in a triangular portion of ATA. But your description (9*9*2) suggests you want to populate every position in ATA. Nevertheless, my intent is not to give you a black box but to demonstrate how you can use various thrust approaches to achieve whatever it is you want in a single kernel call.