Search code examples
c++rstringrcpptoupper

Homemade toupper: looks the same but not identical


For my optimizations, i would like to get a decent toupper in Rcpp. I'm very new to C++, and as for know I've done that:

#include <Rcpp.h>
using namespace Rcpp;

void C_toupper_String(const String& s) {
  for (char *p =(char *)s.get_cstring();*p!=0;p++) *p = toupper(*p);
}
// [[Rcpp::export]]
StringVector C_toupper(StringVector const& vecteur) {
  StringVector res=clone(vecteur);
  for (int i(0); i < res.size(); ++i) {
    C_toupper_String(res[i]);
  }
  return res;
}

/*** R
teststring <- "HeY I arNaud"
C_toupper(teststring)
toupper(teststring)
identical(C_toupper(teststring),toupper(teststring))
*/

However, it doesn't work as it should.

> C_toupper(teststring)
[1] "HEY I ARNAUD"

> toupper(teststring)
[1] "HEY I ARNAUD"

> identical(C_toupper(teststring),toupper(teststring))
[1] FALSE

What's the problem? If possible, I would like not to convert String into std::string, because I would like to understand what's happening: the point of going into C++ is to be able to avoid copies and conversions.

Thanks,

Arnaud


Solution

  • The question why the two strings don’t test identical is difficult to explain — the two strings certainly look identical when inspecting their raw bytes (via charToRaw), they do not carry attributes, and they don’t have an encoding set. So, really, they should be identical.

    To solve the mystery, we need to understand what your C++ code is actually doing. More specifically, what the C-style cast in C_toupper_String is doing. Because of their danger, you should never use C-style casts. Your code is running into problems purely because of that cast.

    Why? Because String::get_cstring returns char const*. You are casting it to char* and thereby casting away its constness. This can be safe, but only if the underlying storage isn’t const. Otherwise it’s undefined behaviour (UB). The effects of UB are difficult to predict due to code rewriting (e.g. optimisations). In this case, it seems that it produces code that messes up the R string internals.


    You fundamentally cannot modify Rcpp::String objects in place, they don’t allow it. But if you merely want to avoid copying then your code is failing its aim anyway, since your C_toupper function explicitly copies the input in the first step.

    As Dirk said, the correct way of solving this problem is to use the available Rcpp API. And in the case of string modifications, this means converting your input to std::string, performing the modifications, and then converting back. This does copy, but so does your current code. Here’s one good way of writing this code:

    #include <Rcpp.h>
    #include <cctype>
    #include <string>
    
    // [[Rcpp::export]]
    Rcpp::StringVector C_toupper(Rcpp::StringVector const& vec) {
        std::vector<std::string> res(vec.begin(), vec.end());
        for (std::string& str : res) {
            for (char& c : str) c = std::toupper(c);
        }
    
        return Rcpp::wrap(res);
    }
    

    Do note that this will sometimes produce wrong results, because std::toupper fundamentally can’t deal with certain Unicode characteristics. R’s toupper does better but also has some problems. A proper solution is using the {stringr} or {stringi} packages.