How can I call a kernel function for multiple Thrust object member functions in CUDA?

Please note that I am an absolute beginner with CUDA, and everything below is untested pseudocode. I'm coming from JavaScript, and my C++ is also super rusty, so I apologize for my ignorance :)

I am attempting to use CUDA for backtesting many different forex strategies.

Using Thrust, and I have 1000 objects instantiated from a class (pseudocode):

#include <stdio.h>
#include <thrust/device_ptr.h>
#include <thrust/device_new.h>

#define N 1000

typedef struct dataPoint {
    ...
} dataPoint;

class Strategy {
    public:
        __device__ __host__ Strategy(...) {
            ...
        }

        __device__ __host__ void backtest(dataPoint data) {
            ...
        }
};

int main() {
    dataPoint data[100000];
    thrust::device_ptr<Strategy> strategies[1000];
    int i;

    // Instantiate 5000 strategies.
    for (i=0; i<1000; i++) {
        thrust::device_ptr<Strategy> strategies[i] = thrust::device_new<Strategy>(...);
    }

    // Iterate over all 100000 data points.
    for (i=0; i<100000; i++) {
        // Somehow run .backtest(data[j]) on each strategy here.
        // i.e. Run backtest() in parallel for all 1000
        // strategy objects here.
    }
}

Now let's say I'd like to run the .backtest() method on each object for each item in data. Procedurally I would do the following:

// Iterate over all 100000 data points.
for (j=0; j<100000; j++) {
    // Iterate over all 1000 strategies.
    for (i=0; i<1000; i++) {
        strategies[i].backtest(data[j]);
    }
}

How might I accomplish this using CUDA such that .backtest() runs in parallel for all strategies each iteration j through the data?

If I have to completely rearchitect everything, so be it -- I'm open to whatever is necessary. If this isn't possible with classes, then so be it.

Solution

Typical thrust code makes use of certain C++ idioms frequently (e.g. functors) so if your C++ is rusty, you may want to read about C++ functors, for example. You may also want to review the thrust quick start guide for a discussion of functors as well as the fancy iterators we will use presently.

In general, at least from an expression standpoint, I think thrust is well-suited to your problem description. Given the flexibility of thrust expression for these types of problems, there are probably many ways to skin the cat. I'm going to try and present something that is "close" to your pseudocode. But there are many ways to realize this, undoubtedly.

First of all, in thrust we generally try to avoid for-loops. These will be quite slow, as they generally involve host code and device code interaction at each iteration (e.g. calling CUDA kernels, under the hood, at each iteration). We prefer to use thrust algorithms if possible, as these will generally "translate" to one or a small number of CUDA kernels, under the hood.

One of the most basic algorithms in thrust is the transform. It comes in a variety of flavors, but basically takes input data and applies an arbitrary operation to it, element by element.

Using basic thrust transform operations, we can initialize your data as well as your strategies, without resorting to a for-loop. We'll construct a device vector of appropriate length for each type of object (dataPoint, Strategy) and then we will use thrust::transform to initialize each vector.

This leaves us with the task of executing each dataPoint against each Strategy. Ideally, we'd like to execute this in parallel also; not just for each of the for-loop iterations you have suggested, but for every Strategy against every dataPoint, all "at once" (i.e. in a single algorithm call).

In effect, we can think of a matrix, one axis consisting of dataPoint (of dimension 100000 in your example) and the other axis consisting of Strategy (of dimension 1000 in your example). For each point in this matrix, we envision it holding the result of the application of that Strategy against that dataPoint.

In thrust, we often prefer to realize such 2D concepts as one-dimensional. Therefore our result space is equal to the product of the the number of dataPoint times the number of Strategy. We will create a result device_vector of this size (100000*1000 in your example) to hold the result.

For the sake of demonstration, since you've given little guidance about the type of arithmetic you want to do, we'll assume the following:

the result of applying a Strategy against a dataPoint is a float.
the dataPoint is a struct consisting of an int (dtype - ignored for this example) and a float (dval). dval will contain, for dataPoint(i), 1.0f + i*10.0f.
the Strategy consists of a multiplier and an adder, to be used as follows:
```
Strategy(i) = multiplier(i) * dval + adder(i);
```
applying a Strategy against a dataPoint consists of retrieving the dval associated with the dataPoint, and substituting it into the equation given by item 3 above. This equation is captured in the backtest method of the class Strategy. The backtest method takes an object of type dataPoint as its argument, from which it will retrieve the appropriate dval.

There are a few more concepts we need to cover. The 1D realization of the 2D result matrix will require that we provide an appropriate indexing means, so that at each point in the 2D matrix, given its linear dimension, we can determine which Strategy and which dataPoint will be used to compute the result at that point. In thrust, we can use a combination of fancy iterators to do this.

In a nutshell, from the "inside out", we will start with a transform iterator which takes an index mapping functor and a linear sequence provided by a thrust::counting_iterator, in order to create a map for each index (each matrix dimension). The arithmetic in each mapping functor will convert the linear index of result to an appropriate repeating index for the rows and columns of the matrix. Given this transform iterator to create the repeating row or column index, we pass that index to a permutation iterator which selects the appropriate dataPoint or Strategy for each row/column indicated. These two items (dataPoint,Strategy) are then zipped together in a zip_iterator. The zip_iterator is then passed to the run_strat functor, which actually computes the given Strategy applied to the given dataPoint.

Here's a sample code outlining the above concepts:

#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <math.h>

#define TOL 0.00001f


// number of strategies
#define N 1000
// number of data points
#define DSIZE 100000

// could use int instead of size_t here, for these problem dimensions
typedef size_t idx_t;


struct dataPoint {

  int dtype;
  float dval;
};

class Strategy {

    float multiplier;
    float adder;
    idx_t id;

    public:

        __device__ __host__ Strategy(){
          id = 0;
          multiplier = 0.0f;
          adder = 0.0f;
          }
        __device__ __host__ Strategy(idx_t _id) {
          id = _id;
          multiplier = 1.0f + ((float)id)/(float)N;
          adder = (float)id;
        }

        __device__ __host__ float backtest(dataPoint data) {
          return multiplier*data.dval+adder;
        }
};

// functor to initialize dataPoint
struct data_init
{
  __host__ __device__
  dataPoint operator()(idx_t id){
    dataPoint temp;
    temp.dtype = id;
    temp.dval = 1.0f + id * 10.0f;
    return temp;
    }
};

// functor to initialize Strategy
struct strat_init
{
  __host__ __device__
  Strategy operator()(idx_t id){
    Strategy temp(id);
    return temp;
    }
};

// functor to "test" a Strategy against a dataPoint, using backtest method
struct run_strat
{
  template <typename T>
  __host__ __device__
  float operator()(idx_t id, T t){
    return (thrust::get<0>(t)).backtest(thrust::get<1>(t));
    }
};

// mapping functor to generate "row" (Strategy) index from linear index
struct strat_mapper : public thrust::unary_function<idx_t, idx_t>
{
  __host__ __device__
  idx_t operator()(idx_t id){
    return id/DSIZE;
    }
};

// mapping functor to generate "column" (dataPoint) index from linear index
struct data_mapper : public thrust::unary_function<idx_t, idx_t>
{
  __host__ __device__
  idx_t operator()(idx_t id){
    return id%DSIZE;
    }
};



int main() {
    // initialize data
    thrust::device_vector<dataPoint> data(DSIZE);
    thrust::transform(thrust::counting_iterator<idx_t>(0), thrust::counting_iterator<idx_t>(DSIZE), data.begin(), data_init());

    // initialize strategies
    thrust::device_vector<Strategy> strategies(N);
    thrust::transform(thrust::counting_iterator<idx_t>(0), thrust::counting_iterator<idx_t>(N), strategies.begin(), strat_init());

    // test each data point against each strategy

        // Somehow run .backtest(data[j]) on each strategy here.
        // i.e. Run backtest() in parallel for all 1000
        // strategy objects here.

    // allocate space for results for each datapoint against each strategy
    thrust::device_vector<float> result(DSIZE*N);
    thrust::transform(thrust::counting_iterator<idx_t>(0), thrust::counting_iterator<idx_t>(DSIZE*N), thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(strategies.begin(), thrust::make_transform_iterator(thrust::counting_iterator<idx_t>(0), strat_mapper())), thrust::make_permutation_iterator(data.begin(), thrust::make_transform_iterator(thrust::counting_iterator<idx_t>(0), data_mapper())))), result.begin(), run_strat());

    // validation
    // this would have to be changed if you change initialization of dataPoint
    // or Strategy
    thrust::host_vector<float> h_result = result;
    for (int j = 0; j < N; j++){
      float m =  1.0f + (float)j/(float)N;
      float a = j;
      for (int i = 0; i < DSIZE; i++){
        float d =  1.0f + i*10.0f;
        if (fabsf(h_result[j*DSIZE+i] - (m*d+a))/(m*d+a) > TOL) {std::cout << "mismatch at: " << i << "," << j << " was: " << h_result[j*DSIZE+i] << " should be: " << m*d+a << std::endl; return 1;}}}
    return 0;
}

Notes:

as mentioned, this is one possible realization. I think it should be "reasonably" efficient, but there may well be more efficient realizations in thrust. Probably a more complete analysis of your actual strategies and backtest methods would be appropriate before trying to tackle optimizations.
The final transform operation uses a counting_iterator as the first argument (and the second) but this is effectively ignored and "dummy" usage, just to size the problem appropriately. It could be eliminated with a simpler realization, but in my view the easiest way to do this (without further cluttering the code) would be to use C++11 auto to define the zip_iterator, then pass that by itself, plus an offset version of it, to thrust::transform, using the version that takes just one input vector instead of 2. I don't think this should make much difference in performance, and I felt this was slightly easier to parse, but perhaps not.