c++multithreading parallel-processing cilk cilk-plus

How to organize a pool of non thread-safe resources in Cilk Plus (one resource per worker)?

I have a serial code that I would like to parallelize using Cilk Plus; the main loop calls a processing function repeatedly on different sets of data, so the iterations are independent of each other, except for the use of a non thread-safe resource, which is encapsulated into a class (say, nts) provided by an external library which takes a filename and does I/O on it.

If I were using OpenMP, I would create a pool of resources that contains as much resources as I have threads, and access these resources according to the thread ID:

std::vector<nts> nts_pool;
for (std::size_t i{0}; i < omp_get_num_threads(); ++i)
    nts_pool.push_back(nts{});

nts_pool[omp_get_thread_num()].do_stuff();  // from inside the task

Using Cilk Plus, I could do as much using the __cilkrts_get_nworkers() and __cilkrts_get_worker_number() APIs, but from multiple posts on Intel forums, I gathered that this is considered to be a wrong solution to the problem, and the right solution would be to use a holder hyperobject.

Now, the holder solution looks nice indeed, except that I really want to have only as much views created as I have worker threads. That is, for 3 worker threads, I would like to have 3 objects and not more. The justification is that as I say, the resource is provided by a third-party library, is very expensive to construct, and I will have to deal with resulting files afterwards, so the fewer the better.

Unfortunately, I have found out that instead of making a view per worker and keeping it until a sync, holders somehow create and destroy views according to the logic that I don't understand, and there doesn't seem to be a way to influence this behavior.

Is it possible to make holders behave the way I want, and if not, what would be an idiomatic Cilk Plus solution to my problem?

Here is the program I used to investigate holders, note that it creates up to 50 views on my test machine during one run, that are allocated and destroyed seemingly at random:

#include <iostream>
#include <atomic>

#include <cilk/cilk.h>
#include <cilk/holder.h>
#include <cilk/reducer_ostream.h>
#include <cilk/cilk_api.h>

cilk::reducer_ostream *hyper_cout;

class nts {
public:
    nts() : tag_{std::to_string(++id_)} {
        *hyper_cout << "NTS constructor: " << tag_ << std::endl;
    }
    ~nts() {
        *hyper_cout << "NTS destructor: " << tag_ << std::endl;
    }
    void print_tag() {
        *hyper_cout << "NTS tag: " << tag_ << std::endl;
    }
    static void is_lock_free() {
        *hyper_cout << "Atomic is lockfree: " << id_.is_lock_free() << std::endl;
    }
private:
    const std::string tag_;
    static std::atomic_size_t id_;
};

std::atomic_size_t nts::id_{0};

class nts_holder {
public:
    void print_tag() { nts_().print_tag(); }
private:
    cilk::holder<nts> nts_;
};

int main() {

    __cilkrts_set_param("nworkers", "4");

    cilk::reducer_ostream cout{std::cout};
    hyper_cout = &cout;

    *hyper_cout << "Workers: " <<  __cilkrts_get_nworkers() << std::endl;
    nts::is_lock_free();

    nts_holder ntsh;
    ntsh.print_tag();

    for (std::size_t i{0}; i < 1000; ++i) {
        cilk_spawn [&] () {
            ntsh.print_tag();
        } ();
    }

    cilk_sync;

    return 0;

}

Solution

You are correct that holders are a tempting but inefficient solution to this particular problem. If your program is correct using the array of slots with one slot per worker, there is really nothing wrong with using the __cilkrts_get_nworkers() and __cilkrts_get_worker_number() APIs in this case. We do discourage their use in general; preferring to write Cilk Plus code that is oblivious to the number workers because it usually scales better that way. However, there are cases, including this one, where creating a slot per worker is the best strategy.