Search code examples
rlexicographiclexicographic-ordering

Why does "a" < "A" return TRUE in R?


My understanding is that the lexicographic comparison of two characters reflects the numerical comparison of the corresponding Unicode numeric codes of those characters. For example

> utf8ToInt("a")
[1] 97
>
>
> utf8ToInt("b")
[1] 98
>
>
> # Because 97 < 98 the following lexicographical comparison returns TRUE
> "a" < "b"
[1] TRUE
>

Yet, there is a particular case that I cannot understand and explain:

> utf8ToInt("a")
[1] 97
>
>
> utf8ToInt("A")
[1] 65
>
>
> # Because 97 > 65 normally the following lexicographical comparison
> # is supposed to return FALSE, yet to my surprise it returns TRUE!
> "a" < "A"
[1] TRUE
>

I checked with Python 3.10.4, DuckDB 0.7.0 and I can confirm that both of them return FALSE as the result for the comparison 'a' < 'A'. However, PostgreSQL 15.3 like R, returns TRUE. So I'm really confused by this difference of behaviour.

Why does R and PostgreSQL return TRUE for the comparison "a" < "A" ?


Solution

  • There is not really a notion of a single atomic char in R as you get in other languages like C; the closest R has is a length-1 string vector where the single string contains a single character. This means that comparison operators always compare single characters the same as they compare any other strings - that is, lexicographically, not as 8-bit numbers as in C. This lexicographic ordering is locale-dependent, being ascertained by the Scollate function in the underlying C code, which does so by reading the LC_COLLATE environment variable.

    If you want ASCII-based lexicographic ordering, you can set LC_COLLATE to "C":

    Sys.setlocale("LC_COLLATE", "C")
    

    Then we have:

    "a" < "A"
    #> [1] FALSE
    

    Just be aware that any code you write this way will not be portable. It is more robust to use utf8ToInt or similar, and specifically convert to numbers if you want to compare ASCII values in R.