I am working with a messy table scraped from a website, and in order to make the date column more useful, I need to clean what was scraped. Our data looks something like this:
mydata <- structure(list(Dates = c("Sep\r\n \r\n 10 - 13",
"Oct\r\n \r\n 8 - 11", "Oct 29 - Nov 1",
"Nov\r\n \r\n 19 - 22", "Jan\r\n \r\n 21 - 24",
"Mar\r\n \r\n 4 - 7", "Apr 29 - May 2"),
points = c("500", "500", "500", "500", "500", "550", "500"
)), row.names = c(1L, 5L, 8L, 11L, 16L, 23L, 32L), class = "data.frame")
> mydata
Dates points
1 Sep\r\n \r\n 10 - 13 500
5 Oct\r\n \r\n 8 - 11 500
8 Oct 29 - Nov 1 500
11 Nov\r\n \r\n 19 - 22 500
16 Jan\r\n \r\n 21 - 24 500
23 Mar\r\n \r\n 4 - 7 550
32 Apr 29 - May 2 500
Each date in Dates
is a date range that should really be a startDate
and endDate
. What we're trying to create then is:
> newdata
StartDate EndDate points
1 20200910 20200913 500
1 20201008 20201011 500
1 20201029 20201101 500
1 20201119 20201122 500
1 20210121 20210124 500
1 20210304 20210307 500
1 20210429 20210502 500
We can assume that all dates in the months sept - dec are for 2020, and all dates in the months jan - aug are for 2021.
It's not perhaps the cleanest code, but I was successful at splitting up the Dates column into 2 columns
cleaning_dates_df <- do.call('rbind', strsplit(mydata$Dates, '-')) %>% as.data.frame()
colnames(cleaning_dates_df) <- c('start', 'end')
cleaning_dates_df$start <- as.character(cleaning_dates_df$start)
cleaning_dates_df$end <- as.character(cleaning_dates_df$end)
cleaning_dates_df <- cleaning_dates_df %>%
dplyr::mutate(end = ifelse(nchar(end) > 4, end, paste0(trimws(sub("\r\n.*", "", start)), end))) %>%
dplyr::mutate(start = ifelse(nchar(start) < 8, start, paste0(trimws(sub("\r\n.*", "", start)), sub(".*\\s", "", start)))) %>%
dplyr::mutate(end = trimws(end)) %>% dplyr::mutate(start = trimws(start))
head(cleaning_dates_df, 8)
... still need to convert into YYYYMMDD
You can try:
library(lubridate)
library(dplyr)
d <- do.call(rbind, lapply(str_split(gsub("[\v-]", " ", mydata$Dates), "\\s+"), function(x) if (length(x) == 3) append(x, x[1], after = 2) else x) )
start_date <- as.Date(paste(d[,1], d[,2], "2020", sep = "-"), format = "%b-%d-%Y")
end_date <- as.Date(paste(d[,3], d[,4], "2020", sep = "-"), format = "%b-%d-%Y")
start_date <- if_else(month(start_date) < 9, start_date + years(1), start_date)
end_date <- if_else(month(end_date) < 9, end_date + years(1), end_date)
data.frame(start_date, end_date,mydata$points)
start_date end_date mydata.points
1 2020-09-10 2020-09-13 500
2 2020-10-08 2020-10-11 500
3 2020-10-29 2020-11-01 500
4 2020-11-19 2020-11-22 500
5 2021-01-21 2021-01-24 500
6 2021-03-04 2021-03-07 550
7 2021-04-29 2021-05-02 500
Unless you have reason not to it's better to keep the data in a date format. But if you need them as the presented character string you can use format()
, e.g.:
format(start_date, "%Y%m%d")