Search code examples
c++rrcpprcppparallel

Undefined reference to a custom function in a worker (C++ and RcppParallel)


I'm new to C++ programming, trying to experiment with Rcpp through R. I created a function to produce all possible k-mers from a string. It works well in the serial form of it:

#include <Rcpp.h>
#include <string>
#include <iostream>
#include <ctime>
// using namespace Rcpp;

// [[Rcpp::export]]
std::vector< std::string > cpp_kmer( std::string s, int k ){
  std::vector< std::string > kmers;
  int seq_loop_size = s.length() - k+1;
  for ( int z=0; z < seq_loop_size; z++ ) {
    std::string  kmer;
    kmer = s.substr( z, k );
    kmers.push_back( kmer ) ;
  }
  return kmers;
}

However, when I try to use this function in a parallel implementation (using RcppParallel), with the code below:

#include <Rcpp.h>
#include <string>
#include <iostream>
#include <ctime>
using namespace Rcpp;

// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>
using namespace RcppParallel;

struct p_cpp_kmer : public Worker {
  // input string
  std::vector< std::string > seqs;
  int k;
  std::vector< std::string > cpp_kmer( std::string s, int k );
  // destination list
  List output;
  std::string
    sub_s;
  // initialize with source and destination
  p_cpp_kmer(std::vector< std::string > seqs, int k, List output) 
    : seqs(seqs), k(k), output(output) {}

  // calculate k-mers for the range of sequences requested
  void operator()(std::size_t begin, std::size_t end) {
    for (std::size_t i = begin; i < end; i++)
      sub_s = seqs[i];
      cpp_kmer(sub_s, k);
  }
};

// [[Rcpp::export]]
List par_cpp_kmer(std::vector< std::string > seqs, int k, bool v){
  // allocate output list 
  List outpar(num_seqs);
  int num_seqs = seqs.size();
  // p_cpp_kmer functor (pass input and output matrixes)
  p_cpp_kmer par_kmer(seqs, k, outpar);
  parallelFor(0, num_seqs, par_kmer);
  return wrap(outpar);
}

std::vector< std::string > cpp_kmer( std::string s, int k ){
  std::vector< std::string > kmers;
  int seq_loop_size = s.length() - k+1;
  for ( int z=0; z < seq_loop_size; z++ ) {
    std::string  kmer;
    kmer = s.substr( z, k );
    kmers.push_back( kmer ) ;
  }
  return kmers;
}

It fails to compile, giving an: undefined reference to p_cpp_kmer::cpp_kmer(std::string, int)' error.

I know it has to do with declaring/referencing the cpp_kmer, but I just can't figure out where/how to do so appropriately (due to my lack of knowledge in C++).

Thank you very much in advance.


Solution

  • What happens is that your p_cpp_kmer struct declares a cpp_kmer method but it is never defined. Instead what is defined later is the free function cpp_kmer.

    You declare this method

    std::vector< std::string > cpp_kmer( std::string s, int k );
    

    You seem to want to use it:

    void operator()(std::size_t begin, std::size_t end) {
      for (std::size_t i = begin; i < end; i++)
        sub_s = seqs[i];
        cpp_kmer(sub_s, k);
    }
    

    But instead you define the free function cpp_kmer here:

    std::vector< std::string > cpp_kmer( std::string s, int k ){
      std::vector< std::string > kmers;
      int seq_loop_size = s.length() - k+1;
      for ( int z=0; z < seq_loop_size; z++ ) {
        std::string  kmer;
        kmer = s.substr( z, k );
        kmers.push_back( kmer ) ;
      }
      return kmers;
    }
    

    You could either remove the definition of the cpp_kmer method in the struct so that the free function is used, or actually define it.

    There are additional problems with the code:

    • In your operator() you discard the result. I guess you mean to have this instead output[i] = cpp_kmer(sub_s, k);

    • even if you do something like the above the code is unsafe, because output[i] = cpp_kmer(sub_s, k); allocates R objects (each individual R string and the string vector) , that cannot happen in a separate thread.

    If you really want to do this in parallel, you need to make sure that you don't allocate any R object in the workers.

    Furthermore, writing parallel code is much easier when you consider using C++11 and the tbb library that is underlying RcppParallel. For example:

    #include <Rcpp.h>
    #include <RcppParallel.h>
    
    using namespace Rcpp;
    using namespace RcppParallel;
    
    // [[Rcpp::depends(RcppParallel)]]
    // [[Rcpp::plugins(cpp11)]]
    
    using string_vector = std::vector< std::string > ; 
    using list_string_vector = std::vector<string_vector> ;
    
    // [[Rcpp::export]]
    list_string_vector par_cpp_kmer( string_vector  seqs, int k, bool v){
      int num_seqs = seqs.size() ;
    
      list_string_vector out(num_seqs) ;
    
      tbb::parallel_for( 0, num_seqs, 1, [&seqs,k,&out](int i){
        std::string& s = seqs[i] ;
        int seq_loop_size = s.length() - k+1;
    
        std::vector<std::string> vec(seq_loop_size) ;
        for ( int z=0; z < seq_loop_size; z++ ) {
          vec[z] = s.substr( z, k );
        }
        out[i] = vec ;
    
      }) ;
      return out ;
    }
    

    This is assuming that std::string can be allocated in separate threads:

    > par_cpp_kmer( c("foobar", "blabla"), 3 )
    [[1]]
    [1] "foo" "oob" "oba" "bar"
    
    [[2]]
    [1] "bla" "lab" "abl" "bla"