Search code examples
rdatedate-range

In R, convert "Oct 29 - Nov 1" to "20201029" and "20201101"


I am working with a messy table scraped from a website, and in order to make the date column more useful, I need to clean what was scraped. Our data looks something like this:

mydata <- structure(list(Dates = c("Sep\r\n            \r\n            10 - 13", 
"Oct\r\n            \r\n            8 - 11", "Oct 29 - Nov 1", 
"Nov\r\n            \r\n            19 - 22", "Jan\r\n            \r\n            21 - 24", 
"Mar\r\n            \r\n            4 - 7", "Apr 29 - May 2"), 
    points = c("500", "500", "500", "500", "500", "550", "500"
    )), row.names = c(1L, 5L, 8L, 11L, 16L, 23L, 32L), class = "data.frame")


> mydata
                                        Dates points
1  Sep\r\n            \r\n            10 - 13    500
5   Oct\r\n            \r\n            8 - 11    500
8                              Oct 29 - Nov 1    500
11 Nov\r\n            \r\n            19 - 22    500
16 Jan\r\n            \r\n            21 - 24    500
23   Mar\r\n            \r\n            4 - 7    550
32                             Apr 29 - May 2    500

Each date in Dates is a date range that should really be a startDate and endDate. What we're trying to create then is:

> newdata
     StartDate   EndDate  points
1     20200910  20200913     500
1     20201008  20201011     500
1     20201029  20201101     500
1     20201119  20201122     500
1     20210121  20210124     500
1     20210304  20210307     500
1     20210429  20210502     500

We can assume that all dates in the months sept - dec are for 2020, and all dates in the months jan - aug are for 2021.

Edit 1

It's not perhaps the cleanest code, but I was successful at splitting up the Dates column into 2 columns

  cleaning_dates_df <- do.call('rbind', strsplit(mydata$Dates, '-')) %>% as.data.frame()
  colnames(cleaning_dates_df) <- c('start', 'end')
  cleaning_dates_df$start <- as.character(cleaning_dates_df$start)
  cleaning_dates_df$end <- as.character(cleaning_dates_df$end)
  cleaning_dates_df <- cleaning_dates_df %>%
    dplyr::mutate(end = ifelse(nchar(end) > 4, end, paste0(trimws(sub("\r\n.*", "", start)), end))) %>%
    dplyr::mutate(start = ifelse(nchar(start) < 8, start, paste0(trimws(sub("\r\n.*", "", start)), sub(".*\\s", "", start)))) %>%
    dplyr::mutate(end = trimws(end)) %>% dplyr::mutate(start = trimws(start))

  head(cleaning_dates_df, 8)

... still need to convert into YYYYMMDD


Solution

  • You can try:

    library(lubridate)
    library(dplyr)
    
    d <- do.call(rbind, lapply(str_split(gsub("[\v-]", " ", mydata$Dates), "\\s+"), function(x) if (length(x) == 3) append(x, x[1], after = 2) else x) )
    
    start_date <- as.Date(paste(d[,1], d[,2], "2020", sep = "-"), format = "%b-%d-%Y")
    end_date <- as.Date(paste(d[,3], d[,4], "2020", sep = "-"), format = "%b-%d-%Y")
    
    start_date <- if_else(month(start_date) < 9, start_date + years(1), start_date)
    end_date <- if_else(month(end_date) < 9, end_date + years(1), end_date)
    
    data.frame(start_date, end_date,mydata$points)
    
      start_date   end_date mydata.points
    1 2020-09-10 2020-09-13           500
    2 2020-10-08 2020-10-11           500
    3 2020-10-29 2020-11-01           500
    4 2020-11-19 2020-11-22           500
    5 2021-01-21 2021-01-24           500
    6 2021-03-04 2021-03-07           550
    7 2021-04-29 2021-05-02           500
    

    Unless you have reason not to it's better to keep the data in a date format. But if you need them as the presented character string you can use format(), e.g.:

    format(start_date, "%Y%m%d")