Search code examples
rdatelubridate

Extract year, month and day when dates are non-standard format


I have a column of dates, I want to extract the year, month and day into separate columns. Unfortunately there are inconsistent entries in the dates column, so the normal solution of using format(as.Date(),"%Y") or lubridate::year() doesn't work.

Here is an example dataframe:

dates_df <- data.frame(dates = c("1985-03-23", "", "1983", "1984-01"))

And here is the desired result:

       dates year month  day
1 1985-03-23 1985     3   23
2            <NA>  <NA> <NA>
3       1983 1983  <NA> <NA>
4    1984-01 1984     1 <NA>

I can achieve the desired result with the following code, but it is very slow on large datasets (>100,000 rows):

dates_df$year <- sapply(dates_df$dates, function(x) unlist(strsplit(x, "\\-"))[1])
dates_df$month <- sapply(dates_df$dates, function(x) unlist(strsplit(x, "\\-"))[2])
dates_df$day <- sapply(dates_df$dates, function(x) unlist(strsplit(x, "\\-"))[3])

My question:

Is there a more efficient (fast) way to extract year, month, day columns from messy date data?


Solution

  • Using strsplit and adapting the lengths.

    cbind(dates_df, t(sapply(strsplit(dates_df$dates, '-'), `length<-`, 3)))
    #        dates    1    2    3
    # 1 1985-03-23 1985   03   23
    # 2            <NA> <NA> <NA>
    # 3       1983 1983 <NA> <NA>
    # 4    1984-01 1984   01 <NA>
    

    With nice names:

    cbind(dates_df, `colnames<-`(
      t(sapply(strsplit(dates_df$dates, '-'), `length<-`, 3)), c('year', 'month', 'day')))
    #        dates year month  day
    # 1 1985-03-23 1985    03   23
    # 2            <NA>  <NA> <NA>
    # 3       1983 1983  <NA> <NA>
    # 4    1984-01 1984    01 <NA>