Search code examples
rcsvencodingutf-8non-ascii-characters

R on Windows: character encoding hell


I am trying to import a CSV encoded as OEM-866 (Cyrillic charset) into R on Windows. I also have a copy that has been converted into UTF-8 w/o BOM. Both of these files are readable by all other applications on my system, once the encoding is specified.

Furthermore, on Linux, R can read these particular files with the specified encodings just fine. I can also read the CSV on Windows IF I do not specify the "fileEncoding" parameter, but this results in unreadable text. When I specify the file encoding on Windows, I always get the following errors, for both the OEM and the Unicode file:

Original OEM file import:

> oem.csv <- read.table("~/csv1.csv", sep=";", dec=",", quote="",fileEncoding="cp866")   #result:  failure to import all rows
Warning messages:
1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  invalid input found on input connection '~/Revolution/RProject1/csv1.csv'
2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  number of items read is not a multiple of the number of columns

UTF-8 w/o BOM file import:

> unicode.csv <- read.table("~/csv1a.csv", sep=";", dec=",", quote="",fileEncoding="UTF-8") #result:    failure to import all row
Warning messages:
1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  invalid input found on input connection '~/Revolution/RProject1/csv1a.csv'
2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  number of items read is not a multiple of the number of columns

Locale info:

> Sys.getlocale()
   [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

What is it about R on Windows that is responsible for this? I've pretty much tried everything I could by this point, besides ditching windows.

Thank You

(Additional failed attempts):

>Sys.setlocale("LC_ALL", "en_US.UTF-8") #OS reports request to set locale to "en_US.UTF-8" cannot be honored
>options(encoding="UTF-8") #now nothing can be imported  
> noarg.unicode.csv <- read.table("~/Revolution/RProject1/csv1a.csv", sep=";", dec=",", quote="")   #result: mangled cyrillic
> encarg.unicode.csv <- read.table("~/Revolution/RProject1/csv1a.csv", sep=";", dec=",", quote="",encoding="UTF-8") #result: mangled cyrillic

Solution

  • Simple answer.

    Sys.setlocale(locale = "Russian")

    if you just want the russian language (not formats, currency):

    'Sys.setlocale(category = "LC_COLLATE", locale = "Russian")'

    'Sys.setlocale(category = "LC_CTYPE", locale = "Russian")'

    If happen to be using Revolution R Open 3.2.2, you may need to set the locale in the Control Panel as well: otherwise - if you have RStudio - you'll see Cyrillic text in the Viewer and garbage in the console. So for example if you type a random cyrillic string and hit enter you'll get garbage output. Interestingly, Revolution R does not have the same problem with say Arabic text. If you use regular R, it seems that Sys.setlocale() is enough.

    'Sys.setlocale()' was suggested by user G. Grothendieck's here: R, Windows and foreign language characters