Search code examples
c++rrcpp

Rcpp function for subsetting strings


I was wondering if there was an Rcpp function which takes an Rcpp::String data type as input and returns a given character (by index) of the string. For example, extracting the character at index 0 of the string. This would be equivalent to the string::at method from the string header in c++. I have written this:

#include <vector>
#include <string>
#include <Rcpp.h>

using namespace Rcpp;

typedef std::vector<std::string> stringList;

int SplitGenotypesA(std::string s) {
    char a = s.at(0);
    int b = a - '0';
    return b;
}

But would prefer not to have to convert between Rcpp::String and std::string types.


Solution

  • You can feed an R vector of strings directly to C++ using Rcpp::StringVector. This will obviously handle single elements too.

    Getting the nth character of the ith element of your vector is as simple as vector[i][n].

    So, without using std::string you can do this:

    #include<Rcpp.h>
    
    // [[Rcpp::export]]
    Rcpp::NumericVector SplitGenotypesA(Rcpp::StringVector R_character_vector)
    {
      int number_of_strings = R_character_vector.size();
      Rcpp::NumericVector result(number_of_strings);
      for(int i = 0; i < number_of_strings; ++i)
      {
        char a = R_character_vector[i][0];
        result[i] = a - '0';
      }
      return result;
    }
    

    Now in R you can do:

    SplitGenotypesA("9C")
    # [1] 9
    

    or better yet,

    SplitGenotypesA(c("1A", "2B", "9C"))
    # [1] 1 2 9
    

    Which is even a little faster than the native R method of doing the same thing:

    microbenchmark::microbenchmark(
      R_method    = as.numeric(substr(c("1A", "2B", "9C"), 1, 1)), 
      Rcpp_method = SplitGenotypesA(c("1A", "2B", "9C")),
      times = 1000)
    
    # Unit: microseconds
    #         expr   min    lq     mean median    uq    max neval
    #     R_method 3.422 3.765 4.076722  4.107 4.108 46.881  1000
    #  Rcpp_method 3.080 3.423 3.718779  3.765 3.765 32.509  1000