c++templates cuda metaprogramming thrust

Self unrolling CUDA n dim same type tuple creation using default thrust in Cuda 8+?

I'm trying to unroll an implementation of a function in order to perform optimization in cuda. Basically I have a piece of shared memory which originally was slowing down my code, and by "unrolling" my implementation (reducing the number of total threads, and each thread doing twice the work) I was able to get substantial performance gains. I want to see if I can manage more performance gains with more unrolling, however I made extensive use of tuples in order to get this to happen. I find that a lot of code duplication happens in this process, and I'd like to cut down on the duplication.

Here is an example of the kind of thing that happens frequently in my code:

__device__
thrust::tuple<T,T,T,...> foo(thrust::tuple<G,G,G..> choice_arg...){
    //all do the same thing, with very similar args as well.
    T value1 = someoperation(thrust::get<0>(choice_arg),...);
    T value2 = someoperation(thrust::get<1>(choice_arg),...);
    T value3 = someoperation(thrust::get<2>(choice_arg),...);
    ...
    return thrust::make_tuple(value1, value2, value3,...);
}

Instead of writing all the boiler plate here myself, I'd like just have a function like this:

__device__
thrust::tuple<T,T,T,...> foo(thrust::tuple<G,G,G..> choice_arg, ...){
    return someoperation<CHOICE_ARG_LENGTH>(choice_arg,...);
}

I've seen how something like this could help, but a normal template loop won't work if I need to return a thrust::tuple. That solution would work if thrust had thrust::tuple_cat however they've yet to merge variadic template tuples, despite the work being done in 2014, and I can't even find any talks referencing merging the cat implementation! So is it possible to implement the behavior I'm looking for with out an thrust::tuple_cat implementation on the GPU?

Note that I cannot use arrays for this, after originally using arrays I found that I got a %15 speed improvement for free, seen in both the visual profiler and the real world application of the algorithm I had. The code is very performance critical.

Solution

In case you can use CUDA 9 and c++14 you can do the following, for details see e.g. std::integer_sequence.

#include <iostream>
#include <utility>
#include <thrust/tuple.h>

template <typename T>
__device__ T some_operation(T a) {
  return a + 1;  // do something smart
}

template <typename T, std::size_t... I>
__device__ auto foo_impl(const T& t, std::index_sequence<I...>) {
  return thrust::make_tuple(some_operation(thrust::get<I>(t))...);
}

template <typename Tuple>
__device__ auto foo(const Tuple& t) {
  return foo_impl(t,
                  std::make_index_sequence<thrust::tuple_size<Tuple>::value>());
}

__global__ void test_kernel() {
  auto result = foo(thrust::make_tuple(3., 2, 7));
  printf("%f, %d, %d\n", thrust::get<0>(result), thrust::get<1>(result),
         thrust::get<2>(result));
}

int main() {
  test_kernel<<<1, 1>>>();
  cudaDeviceSynchronize();
}

Compile with nvcc -std=c++14 ...

For c++11

you need to

provide your own implementation of index_sequence
use trailing return types.

Here is a working version. Disclaimer: I wrote the index_sequence as it came to my mind. Maybe you want to have at an implementation from the std library.

You can probably find a lot of tutorials about index_sequence/integer_sequence on the web, e.g. on cppreference.com. The basic idea of the index_sequence is that it allows enumerating your tuple (or array) elements. In foo a index_sequence is made which has 0, ..., thrust::tuple_size<Tuple>::value as its template parameters. In foo_impl you capture these indices in the variadic pack and expand it to call some_operation for each of the tuple elements.

#include <iostream>
#include <thrust/tuple.h>

namespace compat {
template <size_t... Indices>
struct index_sequence {};

namespace detail {
template <size_t N, typename Seq = index_sequence<>>
struct make_index_sequence_impl;

template <size_t N, size_t... Indices>
struct make_index_sequence_impl<N, index_sequence<Indices...>> {
  using type = typename make_index_sequence_impl<
      N - 1, index_sequence<N - 1, Indices...>>::type;
};

template <size_t... Indices>
struct make_index_sequence_impl<1, index_sequence<Indices...>> {
  using type = index_sequence<0, Indices...>;
};
}

template <size_t N>
using make_index_sequence = typename detail::make_index_sequence_impl<N>::type;
}

template <typename T>
__device__ T some_operation(T a) {
  return a + 1;  // do something smart
}

template <typename T, std::size_t... I>
__device__ auto foo_impl(const T& t, compat::index_sequence<I...>)
    -> decltype(thrust::make_tuple(some_operation(thrust::get<I>(t))...)) {
  return thrust::make_tuple(some_operation(thrust::get<I>(t))...);
}

template <typename Tuple>
__device__ auto foo(const Tuple& t) -> decltype(foo_impl(
    t, compat::make_index_sequence<thrust::tuple_size<Tuple>::value>())) {
  return foo_impl(
      t, compat::make_index_sequence<thrust::tuple_size<Tuple>::value>());
}

__global__ void test_kernel() {
  auto result = foo(thrust::make_tuple(3., 2, 7));
  printf("%f, %d, %d\n", thrust::get<0>(result), thrust::get<1>(result),
         thrust::get<2>(result));
}

int main() {
  test_kernel<<<1, 1>>>();
  cudaDeviceSynchronize();
}