Search code examples
rdatexlsx

How to import xlsx with mixed date types in same column?


I am trying to import an Excel spreadsheet in to R (via read.xlsx2()). The Excel data has a date column. That date column contains mixed types of date formats e.g. some rows are 42669, and some are in date format e.g. 26/10/2016.

read.xlsx2() reads it in as a factor, so I converted it to as.Date using the code below. This works for all the dates in numeric form (e.g. 42669) but R warns me that it added some NAs (for the ones in format 26/10/2016). My question is how can I import the excel data with proper dates for all the variable i.e. tell R that there is mixed data?

library(xlsx)
#Import excel file
df <- read.xlsx2(mydata, 1, header=true)

#Output = recd_date : Factor w/ 590 levels "", "26/10/2016", "42669" ...
levels(df$recd_date)

#Output = [1] "" "26/10/2016" "42669" ...

#This works for numeric dates: 

df$recd_date <- as.Date( as.numeric (as.character(df$recd_date) ),origin="1899-12-30")

#Output = recd_date : Date, format "2016-10-26" ...

#but it doesn't work for dd/mm/yyyy dates, R just replaces these with NA

Solution

  • We could apply a function to clean date if necessary, basically like this:

    cleanDate <- function(x) {
      if (all(nchar(df2$date.mix) < 10)) {
        cd <- as.Date(x)
      } else {
        cd <- do.call(c, 
                      lapply(x, function(i)
                        if (nchar(i) < 10)  
                          as.Date(as.numeric(i), origin="1970-01-01")
                        else as.Date(i)))
      }
      return(cd)
    }
    

    Example

    # generate test df
    df1 <- data.frame(date.chr=as.character(as.Date(1:3, origin=Sys.Date())), 
                      date.num=as.numeric(as.Date(1:3, origin=Sys.Date())),
                      date.mix=as.character(as.Date(1:3, origin=Sys.Date())),
                      stringsAsFactors=FALSE)
    
    df1[2, 3] <- as.character(as.numeric(as.Date(df1[2, 1])))
    > df1
        date.chr date.num   date.mix
    1 2019-02-01    17928 2019-02-01
    2 2019-02-02    17929      17929
    3 2019-02-03    17930 2019-02-03
    
    # write it to working directory
    library(xlsx)
    write.xlsx2(df1, "df1.xlsx")
    
    # read it
    # we use opt. `stringsAsFactors=FALSE` to prevent generation of factors
    df2 <- read.xlsx2("df1.xlsx", 1, stringsAsFactors=FALSE)
    > df2
      X.   date.chr date.num   date.mix
    1  1 2019-02-01    17928 2019-02-01
    2  2 2019-02-02    17929      17929
    3  3 2019-02-03    17930 2019-02-03
    

    Now we apply the function using lapply().

    date.cols <- c("date.chr", "date.num", "date.mix")  # select date columns
    df2[date.cols] <- lapply(df2[date.cols], cleanDate)
    

    Result

    > df2
      X.   date.chr   date.num   date.mix
    1  1 2019-02-01 2019-02-01 2019-02-01
    2  2 2019-02-02 2019-02-02 2019-02-02
    3  3 2019-02-03 2019-02-03 2019-02-03