Search code examples
multithreadingthrustmontecarloopenmp

Use Thust OMP to parallelize a Monte Carlo on CPU


The goal is to parallelize a Monte Carlo process using thrust::omp

int main()
{
  unsigned Nsimulations = 1000;
  // construct some objects here that will be required for Monte Carlo
  A a;
  B b;
  C c;

  /*
   * use thrust::omp to run the simulation Nsimulation times
   * and write a custom reduction function (a sum for objects of type R)
   */
}

// this is the Monte Carlo function - it needs the global variables a, b, c
// passed by reference because they are very large; the return type is R
R monteCarlo(A& a, B& b, C& c)
{
   // something supercomplicated here
   return r;
}

I need to know:

  1. if/how can I access the global variables a, b, c (read only so no race condition problems here)
  2. how to set the number of threads (Nsimulations is in the thousands maybe more, so I wouldn't want an overkill for that.
  3. I want to run the monteCarlo function Nsimulation times and possibly store them in a vector and eventually either a thrust reduction or a serial reduction as this is not the time consuming part.

Solution

  • As I said in your previous question, you're going to have to learn more about thrust in order for answers to questions like this to have any meaning for you.

    You could do:

    thrust::generate(...);
    

    to generate your initial random numbers. The length of the vector here would be the number of simulations you want.

    You could then do:

    thrust::for_each(...);
    

    passing the previously generated random number vector as input, possibly zipped with a result vector. The for_each operation would use a custom functor to perform your monteCarlo routine as a separate invocation on each random number element in the input vector. The output of the monteCarlo routine for each input element would go in the corresponding location in the result vector. The parameter/data in A,B,C globals could be passed by reference as initialization parameters to the functor used by for_each.

    You could then do:

    thrust::reduce(...);
    

    on the previously generate result vector, to create your final reduction.

    I wouldn't be concerned about the mapping of simulations you want to do, to OMP threads. Thrust will handle that, just as an #pragma omp parallel for would handle it in the OMP case.

    Here's a fully worked example:

    $ cat t536.cpp
    #include <iostream>
    #include <stdlib.h>
    #include <thrust/system/omp/execution_policy.h>
    #include <thrust/system/omp/vector.h>
    #include <thrust/reduce.h>
    #include <thrust/for_each.h>
    #include <thrust/iterator/zip_iterator.h>
    
    
    struct A {
      unsigned a;
    };
    
    struct B {
    
      int b;
    };
    
    struct C {
    
      float c;
    };
    
    A a;
    B b;
    C c;
    
    float monteCarlo(int rn){
    
      return ((rn % a.a)+ b.b)/c.c;
    }
    
    struct my_functor
    {
      template <typename Tuple>
      void operator()(const Tuple &data) const{
    
        thrust::get<1>(data) = monteCarlo(thrust::get<0>(data));
       }
    };
    
    
    int main(int argc, char *argv[]){
      a.a = 10;
      b.b = 2;
      c.c = 4.5f;
      unsigned N = 10;
      if (argc > 1) N = atoi(argv[1]);
      thrust::omp::vector<unsigned> rands(N);
      thrust::omp::vector<float> result(N);
      thrust::generate(thrust::omp::par, rands.begin(), rands.end(), rand);
      thrust::for_each(thrust::omp::par, thrust::make_zip_iterator(thrust::make_tuple(rands.begin(), result.begin())), thrust::make_zip_iterator(thrust::make_tuple(rands.end(), result.end())), my_functor());
      float answer = thrust::reduce(thrust::omp::par, result.begin(), result.end());
      std::cout << answer << std::endl;
      return 0;
    }
    $ g++ -O2 -I/usr/local/cuda/include -o t536 t536.cpp -fopenmp -lgomp
    $ ./t536 10
    14.8889
    $
    

    (the answer should converge within the datatype limits on 1.444*N as N gets large, for these choices of a,b,c)

    And as @JaredHoberock mentioned on your previous question, the thrust monte carlo example may be of interest. It demonstrates "fusion" of operations so that a similar activity can be reduced to a single thrust call. The above program could probably be reduced to a single thrust call as well.