For my optimizations, i would like to get a decent toupper
in Rcpp. I'm very new to C++, and as for know I've done that:
#include <Rcpp.h>
using namespace Rcpp;
void C_toupper_String(const String& s) {
for (char *p =(char *)s.get_cstring();*p!=0;p++) *p = toupper(*p);
}
// [[Rcpp::export]]
StringVector C_toupper(StringVector const& vecteur) {
StringVector res=clone(vecteur);
for (int i(0); i < res.size(); ++i) {
C_toupper_String(res[i]);
}
return res;
}
/*** R
teststring <- "HeY I arNaud"
C_toupper(teststring)
toupper(teststring)
identical(C_toupper(teststring),toupper(teststring))
*/
However, it doesn't work as it should.
> C_toupper(teststring)
[1] "HEY I ARNAUD"
> toupper(teststring)
[1] "HEY I ARNAUD"
> identical(C_toupper(teststring),toupper(teststring))
[1] FALSE
What's the problem? If possible, I would like not to convert String
into std::string
, because I would like to understand what's happening: the point of going into C++ is to be able to avoid copies and conversions.
Thanks,
Arnaud
The question why the two strings don’t test identical
is difficult to explain — the two strings certainly look identical when inspecting their raw bytes (via charToRaw
), they do not carry attributes, and they don’t have an encoding set. So, really, they should be identical.
To solve the mystery, we need to understand what your C++ code is actually doing. More specifically, what the C-style cast in C_toupper_String
is doing. Because of their danger, you should never use C-style casts. Your code is running into problems purely because of that cast.
Why? Because String::get_cstring
returns char const*
. You are casting it to char*
and thereby casting away its const
ness. This can be safe, but only if the underlying storage isn’t const
. Otherwise it’s undefined behaviour (UB). The effects of UB are difficult to predict due to code rewriting (e.g. optimisations). In this case, it seems that it produces code that messes up the R string internals.
You fundamentally cannot modify Rcpp::String
objects in place, they don’t allow it. But if you merely want to avoid copying then your code is failing its aim anyway, since your C_toupper
function explicitly copies the input in the first step.
As Dirk said, the correct way of solving this problem is to use the available Rcpp API. And in the case of string modifications, this means converting your input to std::string
, performing the modifications, and then converting back. This does copy, but so does your current code. Here’s one good way of writing this code:
#include <Rcpp.h>
#include <cctype>
#include <string>
// [[Rcpp::export]]
Rcpp::StringVector C_toupper(Rcpp::StringVector const& vec) {
std::vector<std::string> res(vec.begin(), vec.end());
for (std::string& str : res) {
for (char& c : str) c = std::toupper(c);
}
return Rcpp::wrap(res);
}
Do note that this will sometimes produce wrong results, because std::toupper
fundamentally can’t deal with certain Unicode characteristics. R’s toupper
does better but also has some problems. A proper solution is using the {stringr} or {stringi} packages.