Search code examples
rstringasciiliteralsoctal

What do backslash-escaped numbers '\1' to '\7' mean in R strings, and why do they compare wrongly?


Backslash-escaped numbers from 1 to 7 don't seem to do anything when printed.

I'm curious how R interprets them, partly because they seem to obey some strange comparison rules:

'\1' == '\2' # FALSE
'\1' <  '\2' # FALSE
'\1' >  '\2' # FALSE
'\1' <= '\2' # TRUE
'\1' >= '\2' # TRUE

EDIT: Behavior seems to be platform-dependent, so here's my sessionInfo

R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS: /usr/lib/libopenblasp-r0.3.5.so
LAPACK: /usr/lib/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.5.2

Solution

  • They're octal constants/literals for ASCII characters: "Backslashes followed by up to three numbers are interpreted as octal notation for ASCII characters"

    \1 means \001, \2 means \002 etc.; both of those are unprintable control characters (SOM and EOA, to be precise). They are not equivalent to the strings '1', '2', I think you're assuming they are, or should be.

    You can see their actual raw numeric values with:

    > charToRaw('\1')
    [1] 01
    > charToRaw('\2')
    [1] 02
    > charToRaw('1')
    [1] 31
    > charToRaw('2')
    [1] 32
    > charToRaw('\001')
    [1] 01
    > charToRaw('\002')
    [1] 02
    
    • Don't say "shell-escaped" if you mean "backslash-escaped". Do not assume R treats escapes the same as Unix shell; they're different.
    • Yes I agree the </==/> comparison behavior you found is weird and inconsistent, I confirm I got the same results in R 3.5.1 on MacOS in locale en_US.UTF-8.
    • But I don't know if the R language guarantees string-order comparisons on unprintable ASCII constants below 32 decimal ("collation order"/"collating sequence"; this has been a thing since Fortran back in the 1960s). Maybe worthy of a minor bug at most. Most languages specs warn you that messing with any ASCII value below 32 can give strange/undefined behavior.
    • For more, type ?base::Quotes or see R Language Definition : 10.3.1 Constants : Octal Characters