Search code examples
rcsvencodingutf-8windows-11

"invalid multibyte string 8" error popping up for read.csv in R version 4.2.0


I installed the brand-new R version 4.2.0 and tried to run my code written with version 4.1.x.

When reading in data with read.csv this new error popped up:

Error in make.names(col.names, unique = TRUE) : invalid multibyte string 8

I figure that this has to do with the new native UTF-8 support?

I am running R under Windows 11 with English language support and I am not aware of any special characters in the csv file but I cannot rule it out completely either because it is quite a huge file.

What can I do to switch back to the old encoding which ran without any errors?


Solution

  • The default behaviour for R for versions < 4.2 has been:

    If you don't set a default encoding, files will be opened using UTF-8 (on Mac desktop, Linux desktop, and server) or the system's default encoding (on Windows).

    This behaviour has changed in R 4.2:

    R 4.2 for Windows will support UTF-8 as native encoding

    To find out the default encoding on Windows 10, run the following Powershell command:

    [System.Text.Encoding]::Default
    

    The output for this on my Windows 10 machine is:

    IsSingleByte      : True
    BodyName          : iso-8859-1
    EncodingName      : Western European (Windows)
    HeaderName        : Windows-1252
    WebName           : Windows-1252
    WindowsCodePage   : 1252
    IsBrowserDisplay  : True
    IsBrowserSave     : True
    IsMailNewsDisplay : True
    IsMailNewsSave    : True
    EncoderFallback   : System.Text.InternalEncoderBestFitFallback
    DecoderFallback   : System.Text.InternalDecoderBestFitFallback
    IsReadOnly        : True
    CodePage          : 1252
    

    This can be passed to read.csv as the encoding to use:

    read.csv(path_to_file, encoding = "windows-1252")
    

    If you are unsure how to translate the output from Powershell into the relevant string, you can search the list of all encodings with the stringi package:

    # Replace "1252" with the relevant output from the Powershell command
    cat(grep("1252", stringi::stri_enc_list(simplify = FALSE), value = TRUE, ignore.case = TRUE))
    

    You can take your pick from any of the options in the output:

    # c("ibm-1252", "ibm-1252_P100-2000", "windows-1252") c("cp1252", "ibm-5348", "ibm-5348_P100-1997", "windows-1252")