Search code examples
rdplyrpanel-datatidy

Expand rows between begin date and exit date and count the number of days in R


I have a dataset in which each row is a minister ID, for each row I have columns with the date (day, month, and year) the minister entered the government (begin) and exit the government (exit). This is a sample on how it looks like right now: Original

data <- structure(list(id_min = c("1030015", "1030028"), begin_day = c("29", 
"4"), begin_month = c("12", "1"), begin_year = c("2019", "2020"
), exit_day = c("3", "10"), exit_month = c("1", "1"), exit_year = c("2020", 
"2020")), row.names = c(NA, -2L), class = c("data.frame"))

I want to expand the rows between the "begin date" and "exit date", and also create a new column (number_days) counting the number of days the minister was in the government. The other variables in the dataset (omitted here) should just be repeated in the expanded version of the data). This is the data frame output I am looking for: Output dataframe

I was able to put the dates together in the format %Y-%m-%d, and create the new column "number_days" with the following codes

data$begin_date <- as.Date(with(data, paste(begin_year, begin_month, begin_day,sep="-")), "%Y-%m-%d")
data$exit_date <- as.Date(with(data, paste(exit_year, exit_month, exit_day,sep="-")), "%Y-%m-%d")
data$number_days <- data$exit_date - data$begin_date

But I am not having success expanding the rows between the two dates (begin_date and exit_date). I was trying to do it using tidyr::complete().


Solution

  • One approach is with the uncount function:

    library(dplyr)
    library(tidyr)
    library(lubridate)
    data %>% 
      mutate(start_date = mdy(paste(begin_month,begin_day,begin_year,sep="-")),
             end_date = mdy(paste(exit_month,exit_day,exit_year,sep="-")),
             number_days = as.integer(end_date-start_date + 1)) %>%
      uncount(as.integer(number_days)) %>%
      group_by(id_min) %>%
      mutate(begin_day = day(seq(start_date[1], end_date[1], by = "days")),
             begin_month = month(seq(start_date[1], end_date[1], by = "days")),
             begin_year = year(seq(start_date[1], end_date[1], by = "days"))) %>%
      dplyr::select(-start_date, -end_date)
    # A tibble: 13 x 8
    # Groups:   id_min [2]
       id_min  begin_day begin_month begin_year exit_day exit_month exit_year number_days
       <chr>       <int>       <int>      <int> <chr>    <chr>      <chr>           <int>
     1 1030015        29          12       2019 3        1          2020                6
     2 1030015        30          12       2019 3        1          2020                6
     3 1030015        31          12       2019 3        1          2020                6
     4 1030015         1           1       2020 3        1          2020                6
     5 1030015         2           1       2020 3        1          2020                6
     6 1030015         3           1       2020 3        1          2020                6
     7 1030028         4           1       2020 10       1          2020                7
     8 1030028         5           1       2020 10       1          2020                7
     9 1030028         6           1       2020 10       1          2020                7
    10 1030028         7           1       2020 10       1          2020                7
    11 1030028         8           1       2020 10       1          2020                7
    12 1030028         9           1       2020 10       1          2020                7
    13 1030028        10           1       2020 10       1          2020                7