Search code examples
rcharacter-encoding

R friendly greek characters


Edit: years later with R 4.4.0 on a MacBook, β and ß are seen as different and no assignment fails, I'm not sure if it is due to the system or the version

I noticed that I can use some Greek letters as names while others will be either illegal or just aliases to letters from the latin alphabet.

Basically I can use β or µ (though β is changed to ß when printing and ß and β act as alliases)

list(β = 1)
# $ß
# [1] 1
list(μ = 1)
# $µ
# [1] 1

α, Γ, δ, ε, Θ, π, Σ, σ, τ, Φ, φ and Ω are allowed but act as alliases to latin letters.

list(α = 1)
# $a
# [1] 1

αa <- 42
aa
# [1] 42

GG <- 33
ΓΓ 
# [1] 33

Other letters I've tested just don't "work":

ι <- 1
# Error: unexpected input in "\"
Λ <- 1
# Error: unexpected input in "\"
λ <- 1
#Error: unexpected input in "\"

I was surprised about λ as it's defined by the package wrapr's define_lambda, so I assume this depends on the system.

I know similar or identical looking characters can have different encodings, and some of them don't go well with copy/paste between apps, the code of this question returns the described output when pasted back to RStudio.

?make.names says :

A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number

So part of the question is : what's a letter ? and what's going on here ?

More specifically:

  • Are there greek characters that will safely work on all R installations, in particular, are µ and β (or ß) safe to use in a package.
  • why isn't λ ( intToUtf8(955) ) usable on my system while it seems to be commonly use by wrapr's users.
  • Are there other non latin letters, greek or not, that I could safely use in my code ? (for instance Norwegian ø looks cool and seems to work on my system)

This all was prompted by the fact I'm looking for a one (or 2) character function name that wouldn't conflict with an existing or commonly used name, and would look a bit funky. . is already used a lot and I use .. already as well.

from sessionInfo() :

R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252  

Solution

  • I'm not an expert by any means but let's try to analyze the problem. In the end, your R-code needs to be understood by the compiler therefore the source-code of make.names() may be helpful:

    names <- as.character(names)
    names2 <- .Internal(make.names(names, allow_))
    if (unique) {
      o <- order(names != names2)
      names2[o] <- make.unique(names2[o])
    }
    names2
    

    Now, .Internal() calls the R-interpreter (written in C) so we need to go a little deeper. The C-code responsible for handling the make.names() request can be found here: https://github.com/wch/r-source/blob/0dccb93e114b00b2fcbe75e8721f11a8f2ffdff4/src/main/character.c

    A short snipped:

    SEXP attribute_hidden do_makenames(SEXP call, SEXP op, SEXP args, SEXP env)
    {
        SEXP arg, ans;
        R_xlen_t i, n;
        int l, allow_;
        char *p, *tmp = NULL, *cbuf;
        const char *This;
        Rboolean need_prefix;
        const void *vmax;
    
        checkArity(op ,args);
        arg = CAR(args);
        if (!isString(arg))
        error(_("non-character names"));
        n = XLENGTH(arg);
        allow_ = asLogical(CADR(args));
        if (allow_ == NA_LOGICAL)
        error(_("invalid '%s' value"), "allow_");
        PROTECT(ans = allocVector(STRSXP, n));
        vmax = vmaxget();
        for (i = 0 ; i < n ; i++) {
        This = translateChar(STRING_ELT(arg, i));
        l = (int) strlen(This);
        /* need to prefix names not beginning with alpha or ., as
           well as . followed by a number */
        need_prefix = FALSE;
        if (mbcslocale && This[0]) {
            int nc = l, used;
            wchar_t wc;
            mbstate_t mb_st;
            const char *pp = This;
            mbs_init(&mb_st);
            used = (int) Mbrtowc(&wc, pp, MB_CUR_MAX, &mb_st);
            pp += used; nc -= used;
            if (wc == L'.') {
            if (nc > 0) {
                Mbrtowc(&wc, pp, MB_CUR_MAX, &mb_st);
                if (iswdigit(wc))  need_prefix = TRUE;
            }
            } else if (!iswalpha(wc)) need_prefix = TRUE;
        } else {
            if (This[0] == '.') {
            if (l >= 1 && isdigit(0xff & (int) This[1])) need_prefix = TRUE;
            } else if (!isalpha(0xff & (int) This[0])) need_prefix = TRUE;
        }
        if (need_prefix) {
            tmp = Calloc(l+2, char);
            strcpy(tmp, "X");
            strcat(tmp, translateChar(STRING_ELT(arg, i)));
        } else {
            tmp = Calloc(l+1, char);
            strcpy(tmp, translateChar(STRING_ELT(arg, i)));
        }
        if (mbcslocale) {
            /* This cannot lengthen the string, so safe to overwrite it. */
            int nc = (int) mbstowcs(NULL, tmp, 0);
            if (nc >= 0) {
            wchar_t *wstr = Calloc(nc+1, wchar_t);
            mbstowcs(wstr, tmp, nc+1);
            for (wchar_t * wc = wstr; *wc; wc++) {
                if (*wc == L'.' || (allow_ && *wc == L'_'))
                /* leave alone */;
                else if (!iswalnum((int)*wc)) *wc = L'.';
            }
            wcstombs(tmp, wstr, strlen(tmp)+1);
            Free(wstr);
            } else error(_("invalid multibyte string %d"), i+1);
        } else {
            for (p = tmp; *p; p++) {
            if (*p == '.' || (allow_ && *p == '_')) /* leave alone */;
            else if (!isalnum(0xff & (int)*p)) *p = '.';
            /* else leave alone */
            }
        }
    //  l = (int) strlen(tmp);        /* needed? */
        SET_STRING_ELT(ans, i, mkChar(tmp));
        /* do we have a reserved word?  If so the name is invalid */
        if (!isValidName(tmp)) {
            /* FIXME: could use R_Realloc instead */
            cbuf = CallocCharBuf(strlen(tmp) + 1);
            strcpy(cbuf, tmp);
            strcat(cbuf, ".");
            SET_STRING_ELT(ans, i, mkChar(cbuf));
            Free(cbuf);
        }
        Free(tmp);
        vmaxset(vmax);
        }
        UNPROTECT(1);
        return ans;
    }
    

    As we can see, compiler-dependent datatypes such as wchar_t (http://icu-project.org/docs/papers/unicode_wchar_t.html) are used. This means that the behavior of make.names() depends on the C-compiler used to compile the R-interpreter itself. The problem is that C-compilers aren't very standardized therefore no assumption about the behavior of characters can be made. Everything including operating system, hardware, locale etc. can change this behavior.

    In conclusion, I would stick to ASCII characters if you want to be save, especially when sharing your code between different operating systems.