Search code examples
rsortingnon-english

Language dependent sorting with R


1) How to sort correctly?

The task is to sort abbreviated US states names in accordance with English alphabet. But I noticed, that R sorts lists basing on some kind of operating system language or regional settings. E.g., in my language (Lithuanian) even the order of Latin (non-Lithuanian) letters differs from the order in the English alphabet. Compare order of non-Lithuanian letters only in both alphabets:

"ABCDEFGHI Y JKLMNOPRSTUVZ"

sort(LETTERS)
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "Y" "J" "K" "L" "M" "N"
[16] "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Z"

vs.

"ABCDEFGHIJKLMNOPQRSTUVWX Y Z"

LETTERS
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O"
[16] "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

So order of sorted abbreviations of the states also differ (notice the last 2, they should be "WV" and then "WY"):

sort(state.abb)
 [1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "IA"
[13] "ID" "IL" "IN" "KY" "KS" "LA" "MA" "MD" "ME" "MI" "MN" "MO"
[25] "MS" "MT" "NC" "ND" "NE" "NH" "NY" "NJ" "NM" "NV" "OH" "OK"
[37] "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VA" "VT" "WA" "WI"
[49] "WY" "WV"

I tried Sys.setlocale("LC_TIME","English_United States.1252"). It helped to get English names of weekdays in plots, graphs and figures.

Now I need help to sort correctly in "English" way.

2) What are the other important language-dependent settings in R a beginner R user should pay attention to?

If you have advice, where R behaves language-dependently and how to deal with that, please list it.


Solution

  • LC_TIME controls date/time related language collation. For your purposes, LC_ALL should do the trick:

    Sys.setenv('LC_ALL', 'English_United States.1252')
    sort(letters)
    

    However, beware that these settings are operating system specific. The above would for instance not work on a typical Unix system. Instead, the string 'en_US.UTF-8' is generally a good setting — but under Windows, that itself may pose problems as R’s Unicode support is sketchy on Windows.