Search code examples
rdataframedatedplyrtime

data frame breakdown by year, rounding up by duration of each period


I have a data frame of county executives and the first and last years they served.

I am running a panel study with county-year as the unit of analysis. The date range is 2000 to 2009.

I will like to expand the df such that it lists the period served by each executive in long format. Some executives serve until December 31st but others are inaugurated after the beginning of the calendar year.

The catch? If two or more executives served terms during the same calendar year, I want the data frame to attribute the county-year to the executive that was in office the longest.

Also note: caretaker executives that only served a few weeks should not be included in the data frame at all (ie. Tollson and Edwards).

My starting data looks like this:

df.a1 <- data.frame(executive.name= rep(c("Johnson", "Alleghany", "Clarke",  "Roland", "Tollson", "Richards", "Peters", "Harrison", "Burr", "Diamond", "Edwards", "Gorman"),each=2),
                 date= rep(c("from 01-Jan-2000", "to 31-Dec-2002", "from 01-Jan-2003", "to 03-Mar-2004", "from 04-Mar-2004", "to 05-Nov-2005", "from 06-Nov-2005", "to 31-Dec-2007", "from 01-Jan-2008", "to 03-Mar-2008", "from 04-Mar-2008", "to 30-Nov-2009"), times=2),
                  district= c(rep(1001:1002, each=12)))

> df.a1
   executive.name             date district
1         Johnson from 01-Jan-2000     1001
2         Johnson   to 31-Dec-2002     1001
3       Alleghany from 01-Jan-2003     1001
4       Alleghany   to 03-Mar-2004     1001
5          Clarke from 04-Mar-2004     1001
6          Clarke   to 05-Nov-2005     1001
7          Roland from 06-Nov-2005     1001
8          Roland   to 31-Dec-2007     1001
9         Tollson from 01-Jan-2008     1001
10        Tollson   to 03-Mar-2008     1001
11       Richards from 04-Mar-2008     1001
12       Richards   to 30-Nov-2009     1001
13         Peters from 01-Jan-2000     1002
14         Peters   to 31-Dec-2002     1002
15       Harrison from 01-Jan-2003     1002
16       Harrison   to 03-Mar-2004     1002
17           Burr from 04-Mar-2004     1002
18           Burr   to 05-Nov-2005     1002
19        Diamond from 06-Nov-2005     1002
20        Diamond   to 31-Dec-2007     1002
21        Edwards from 01-Jan-2008     1002
22        Edwards   to 03-Mar-2008     1002
23         Gorman from 04-Mar-2008     1002
24         Gorman   to 30-Nov-2009     1002

I will like it to look like this:

df.a1.neat <- data.frame(executive.name= c("Johnson", "Johnson", "Alleghany", "Clarke", "Clarke", "Roland", "Roland", "Richards", "Richards", "Peters", "Peters", "Harrison", "Burr", "Burr", "Diamond", "Diamond", "Gorman", "Gorman"),
                 date= rep(c(2000, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009), times=2),
                  district= c(rep(1001:1002, each=9)))

   executive.name date district
1         Johnson 2000     1001
2         Johnson 2002     1001
3       Alleghany 2003     1001
4          Clarke 2004     1001
5          Clarke 2005     1001
6          Roland 2006     1001
7          Roland 2007     1001
8        Richards 2008     1001
9        Richards 2009     1001
10         Peters 2000     1002
11         Peters 2002     1002
12       Harrison 2003     1002
13           Burr 2004     1002
14           Burr 2005     1002
15        Diamond 2006     1002
16        Diamond 2007     1002
17         Gorman 2008     1002
18         Gorman 2009     1002

Solution

  • This is including all years, because the logic is easier to handle programmatically - if you have a reason that you wish some years removed, I'll let you handle that aspect.

    library(dplyr)
    library(tidyr)
    library(lubridate)
    library(stringr)
    
    df.a1 |> 
      # Set up date_from and date_to vars for period calculation
      mutate(date_from = if_else(str_detect(date, "from"), date, lag(date)),
             date_to = if_else(str_detect(date, "to"), date, lead(date))) |> 
      select(-date) |> 
      # Remove duplicates
      distinct() |> 
      # Further preparing for period calculation
      mutate(date_from = str_remove(date_from, "from "),
             date_to = str_remove(date_to, "to "),
             across(date_from:date_to, dmy),
             year_from = year(date_from),
             year_to = year(date_to)) |>
      # Setting data to long format in order to vectorize period calculation
      pivot_longer(year_from:year_to, names_to = "start_end", values_to = "year") |> 
      # Filling in all years and then resorting the rows back to original sort
      group_by(executive.name, district) |> 
      complete(year = full_seq(min(year):max(year), 1)) |> 
      ungroup() |> 
      arrange(district, year) |> 
      # Calculating the period for each year, district, and name
      mutate(period = case_when(year(date_from) == year(date_to) ~ date_to - date_from, 
                                start_end == "year_from" ~ as.Date(paste0(year, "-12-31")) - date_from,
                                start_end == "year_to" ~ date_to - as.Date(paste0(year, "-01-01")),
                                TRUE ~ as.Date(paste0(year, "-12-31")) - as.Date(paste0(year, "-01-01"))), .by = year) |> 
      # Remove unwanted names for a given district and year
      filter(period == max(period), .by = c(district, year)) |> 
      # Set data frame to desired output
      select(executive.name, year, district)
    
    # A tibble: 20 × 3
       executive.name  year district
       <chr>          <dbl>    <int>
     1 Johnson         2000     1001
     2 Johnson         2001     1001
     3 Johnson         2002     1001
     4 Alleghany       2003     1001
     5 Clarke          2004     1001
     6 Clarke          2005     1001
     7 Roland          2006     1001
     8 Roland          2007     1001
     9 Richards        2008     1001
    10 Richards        2009     1001
    11 Peters          2000     1002
    12 Peters          2001     1002
    13 Peters          2002     1002
    14 Harrison        2003     1002
    15 Burr            2004     1002
    16 Burr            2005     1002
    17 Diamond         2006     1002
    18 Diamond         2007     1002
    19 Gorman          2008     1002
    20 Gorman          2009     1002