Search code examples
stringrcpp

Confused about concatenation of strings in Rcpp


I am trying to loop through a data frame and concatenate word blocks that are separated by a space in Rcpp.

I tried reading some answers on Stack Overflow and I am thoroughly confused on how strings are concatenated in Rcpp. (e.g Concatenate StringVector with Rcpp)

I know in C++ you can just use the + operator to add strings.

This is my Rcpp function below

cppFunction('
Rcpp::StringVector formTextBlocks(DataFrame frame) {
#include <string> 
using namespace Rcpp;
 NumericVector frame_x = as<NumericVector>(frame["x"]);

   LogicalVector space = as<LogicalVector>(frame["space"]);
   Rcpp::StringVector text=as<StringVector>(frame["text"]);
  if (text.size() == 0) {
    return text;
  }
  int dfSize = text.size();

  for(int i = 0;  i < dfSize; ++i) {
    if ( i !=dfSize  ) {
     if (space[i]==true) {

     text[i]=text[i] + text[i+1]  ;

    }
  }

  }
  return text;
}
')

The error is on the lines of error: no match for 'operator+'

How can strings be concatenated inside a loop?


Solution

  • Since operator+ is defined for std::string, it is easiest to just use that by converting the text column to std::vector<std::string> instead of Rcpp::StringVector:

    Rcpp::cppFunction('
    std::vector<std::string> formTextBlocks(DataFrame frame) {
      LogicalVector space = as<LogicalVector>(frame["space"]);
      std::vector<std::string> text=as<std::vector<std::string>>(frame["text"]);
      if (text.size() == 0) {
        return text;
      }
      int dfSize = text.size();
    
      for(int i = 0;  i < dfSize - 1; ++i) {
        if (space[i]==true) {
          text[i]=text[i] + text[i+1];
        }
      }
      return text;
    }
    ')
    
    set.seed(20191129)
    textBlock <- data.frame(space = sample(c(TRUE, FALSE), 100, replace = TRUE),
                            text = sample(LETTERS, 100, replace = TRUE),
                            stringsAsFactors = FALSE)
    formTextBlocks(textBlock)
    #>   [1] "B"  "N"  "G"  "BM" "M"  "O"  "C"  "F"  "OQ" "Q"  "FH" "H"  "D"  "HK" "KH"
    #>  [16] "H"  "S"  "LX" "XO" "OY" "Y"  "E"  "VD" "D"  "TN" "N"  "LL" "LQ" "Q"  "F" 
    #>  [31] "XX" "X"  "S"  "R"  "P"  "L"  "M"  "GK" "KD" "DD" "D"  "H"  "M"  "M"  "K" 
    #>  [46] "N"  "GP" "PG" "G"  "P"  "G"  "O"  "N"  "NY" "Y"  "OX" "X"  "LX" "XF" "FS"
    #>  [61] "SE" "E"  "PS" "S"  "YD" "D"  "F"  "Z"  "H"  "ZN" "N"  "OM" "M"  "XH" "HV"
    #>  [76] "V"  "OX" "X"  "J"  "BZ" "Z"  "FZ" "ZE" "E"  "SV" "V"  "G"  "F"  "DZ" "ZF"
    #>  [91] "F"  "PB" "B"  "K"  "N"  "U"  "B"  "PV" "V"  "C"
    

    Created on 2019-11-29 by the reprex package (v0.3.0)

    Notes:

    • I have removed the #include and using. These are not necessary and do not belong inside the function definition.
    • I have removed the i != dfSize test, which is never false anyway.
    • The length of the loop is reduced by one, since you are reaching out to element i+1.