Search code examples
cudathrust

Thrust Complex Transform of 3 different size vectors


Hello I have this loop in C+, and I was trying to convert it to thrust but without getting the same results... Any ideas? thank you

C++ Code

for (i=0;i<n;i++) 
    for (j=0;j<n;j++) 
      values[i]=values[i]+(binv[i*n+j]*d[j]);

Thrust Code

thrust::fill(values.begin(), values.end(), 0);
thrust::transform(make_zip_iterator(make_tuple(
                thrust::make_permutation_iterator(values.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator(0), IndexDivFunctor(n))),
                binv.begin(),
                thrust::make_permutation_iterator(d.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator(0), IndexModFunctor(n))))),
                make_zip_iterator(make_tuple(
                thrust::make_permutation_iterator(values.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator(0), IndexDivFunctor(n))) + n,
                binv.end(),
                thrust::make_permutation_iterator(d.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator(0), IndexModFunctor(n))) + n)),
                thrust::make_permutation_iterator(values.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator(0), IndexDivFunctor(n))),
                function1()
                );

Thrust Functions

struct IndexDivFunctor: thrust::unary_function<int, int>
{
  int n;

  IndexDivFunctor(int n_) : n(n_) {}

  __host__ __device__
  int operator()(int idx)
  {
    return idx / n;
  }
};

struct IndexModFunctor: thrust::unary_function<int, int>
{
  int n;

  IndexModFunctor(int n_) : n(n_) {}

  __host__ __device__
  int operator()(int idx)
  {
    return idx % n;
  }
};


struct function1
{
  template <typename Tuple>
  __host__ __device__
  double operator()(Tuple v)
  {
    return thrust::get<0>(v) + thrust::get<1>(v) * thrust::get<2>(v);
  }
};

Solution

  • To begin with, some general comments. Your loop

    for (i=0;i<n;i++) 
        for (j=0;j<n;j++) 
          v[i]=v[i]+(B[i*n+j]*d[j]);
    

    is the equivalent of the standard BLAS gemv operation

    enter image description here

    where the matrix is stored in row major order. The optimal way to do this on the device would be using CUBLAS, not something constructed out of thrust primitives.

    Having said that, there is absolutely no way the thrust code you posted is ever going to do what your serial code does. The errors you are seeing are not as a result of floating point associativity. Fundamentally thrust::transform applies the functor supplied to every element of the input iterator and stores the result on the output iterator. To yield the same result as the loop you posted, the thrust::transform call would need to perform (n*n) operations of the fmad functor you posted. Clearly it does not. Further, there is no guarantee that thrust::transform would perform the summation/reduction operation in a fashion that would be safe from memory races.

    The correct solution is probably going to be something like:

    1. Use thrust::transform to compute the (n*n) products of the elements of B and d
    2. Use thrust::reduce_by_key to reduce the products into partial sums, yielding Bd
    3. Use thrust::transform to add the resulting matrix-vector product to v to yield the final result.

    In code, firstly define a functor like this:

    struct functor
    {
      template <typename Tuple>
      __host__ __device__
      double operator()(Tuple v)
      {
        return thrust::get<0>(v) * thrust::get<1>(v);
      }
    };
    

    Then do the following to compute the matrix-vector multiplication

      typedef thrust::device_vector<int> iVec;
      typedef thrust::device_vector<double> dVec;
    
      typedef thrust::counting_iterator<int> countIt;
      typedef thrust::transform_iterator<IndexDivFunctor, countIt> columnIt;
      typedef thrust::transform_iterator<IndexModFunctor, countIt> rowIt;
    
      // Assuming the following allocations on the device
      dVec B(n*n), v(n), d(n);
    
      // transformation iterators mapping to vector rows and columns
      columnIt cv_begin = thrust::make_transform_iterator(thrust::make_counting_iterator(0), IndexDivFunctor(n));
      columnIt cv_end   = cv_begin + (n*n);
    
      rowIt rv_begin = thrust::make_transform_iterator(thrust::make_counting_iterator(0), IndexModFunctor(n));
      rowIt rv_end   = rv_begin + (n*n);
    
      dVec temp(n*n);
      thrust::transform(make_zip_iterator(
                          make_tuple(
                            B.begin(),
                            thrust::make_permutation_iterator(d.begin(),rv_begin) ) ),
                        make_zip_iterator(
                          make_tuple(
                            B.end(),
                            thrust::make_permutation_iterator(d.end(),rv_end) ) ),
                        temp.begin(),
                        functor());
    
      iVec outkey(n);
      dVec Bd(n);
      thrust::reduce_by_key(cv_begin, cv_end, temp.begin(), outkey.begin(), Bd.begin());
      thrust::transform(v.begin(), v.end(), Bd.begin(), v.begin(), thrust::plus<double>());
    

    Of course, this is a terribly inefficient way to do the computation compared to using a purpose designed matrix-vector multiplication code like dgemv from CUBLAS.