Search code examples
rencodingrscript

Difference output running Rscript vs source


I have the following script:

city <- c("Екатеринбург", NA, "Курск", "Псков",
          "березники", "Челябинск", NA, "москва",
          "москва", "Петергоф/Санкт-Петербург",
          "Петергоф/Санкт-Петербург", "Волгоград",
          "Олегегорск", "СПб", "Москва", "Москва",
          "Москва ", "Санкт-Петербург")
city[grep("^(москва|мск|msk)", city, ignore.case = TRUE)] <- "Москва"
city[grep("питер|спб|spb|петербург", city, ignore.case = TRUE)] <- "Санкт-Петербург"
city[grep("Москва|Санкт-Петербург", city, invert = TRUE)] <- "Другие города"
print(city)

When I run Rscript test.R I get some results:

% Rscript test.R
[1] "Другие города"   "Другие города"   "Другие города"   "Другие города"
[5] "Другие города"   "Другие города"   "Другие города"   "Москва"
[9] "Москва"          "Санкт-Петербург" "Санкт-Петербург" "Другие города"
[13] "Другие города"   "Санкт-Петербург" "Москва"          "Москва"
[17] "Москва"          "Санкт-Петербург"

When I run source("test.R") I get the different results:

% Rscript -e 'source("test.R")'
[1] "Другие города"            "Другие города"           
[3] "Другие города"            "Другие города"           
[5] "Другие города"            "Другие города"           
[7] "Другие города"            "Москва"                  
[9] "Москва"                   "Петергоф/Санкт-Петербург"
[11] "Петергоф/Санкт-Петербург" "Другие города"           
[13] "Другие города"            "Другие города"
[15] "Москва"                   "Москва"
[17] "Москва "                  "Санкт-Петербург"

I got correct results when:

  • run script with Rscript: Rscript test.R
  • type comands in the R session row by row

With source() I got incorrect results (with Rscript -e or inside R session).

System info may be helpful:

sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Arch Linux

locale:
    [1] LC_CTYPE=ru_RU.UTF-8       LC_NUMERIC=C               LC_TIME=ru_RU.UTF-8        LC_COLLATE=C              
[5] LC_MONETARY=ru_RU.UTF-8    LC_MESSAGES=ru_RU.UTF-8    LC_PAPER=ru_RU.UTF-8       LC_NAME=C                 
[9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=ru_RU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
    [1] tools_3.2.1

Solution

  • It has to do with file encoding. Add the following options to source: encoding="UTF-8", verbose=T

    If you leave off the encoding option (keeping verbose=T option), you will see at the top of the output that the default encoding is encoding = "native.enc" which is not what you want for Greek characters.