Search code examples
rxlsxopenxlsx

Importing -100 as NA


I'm dealing with a huge Excel file (thousands of columns), and the researchers have used various ways to encodeNA, icluding -100. How would I import this? I've tried

library("openxlsx")
df <- read.xlsx("file.xlsx", sheet = 1, colNames = TRUE, detectDates=TRUE, skipEmptyRows=TRUE, na.strings=c("NA", "N/A", "-100", "-"))

However, -100 still appears as, well, -100, not NA.


Solution

  • This appears to be a bug in openxlsx::read.xlsx. I created a small .xlsx document with two columns:

    enter image description here

    Then tried reading it with read.xlsx. The na.strings argument doesn't seem to work very well. It omits the last row with two "N/A" values (not desired) and keeps the "-99" values as-is rather than replacing them with NA as desired:

    library(openxlsx)
    read.xlsx("test.xlsx", na.strings = c("N/A", "-99"))
    #   num  char
    # 1   1 hello
    # 2 -99   -99
    # 3   3     3
    
    # for comparison, without na.strings
    read.xlsx("test.xlsx")
    #   num  char
    # 1   1 hello
    # 2 -99   -99
    # 3   3     3
    # 4 N/A   N/A
    

    The readxl package does much better:

    library(readxl)
    read_excel("test.xlsx", na = "-99")
    # # A tibble: 4 x 2
    #     num char 
    #   <dbl> <chr>
    # 1     1 hello
    # 2    NA NA   
    # 3     3 3    
    # 4    NA NA   
    

    This was using a freshly installed openxlsx version 4.1.0, and readxl version 1.2.0 (current version is 1.3.0).


    The openxlsx github page has an open issue regarding na.strings. I added this example. You can track/comment on the issue here.