I have a data frame of county executives and the first and last years they served.
I am running a panel study with county-year as the unit of analysis. The date range is 2000 to 2009.
I will like to expand the df such that it lists the period served by each executive in long format. Some executives serve until December 31st but others are inaugurated after the beginning of the calendar year.
The catch? If two or more executives served terms during the same calendar year, I want the data frame to attribute the county-year to the executive that was in office the longest.
Also note: caretaker executives that only served a few weeks should not be included in the data frame at all (ie. Tollson and Edwards).
My starting data looks like this:
df.a1 <- data.frame(executive.name= rep(c("Johnson", "Alleghany", "Clarke", "Roland", "Tollson", "Richards", "Peters", "Harrison", "Burr", "Diamond", "Edwards", "Gorman"),each=2),
date= rep(c("from 01-Jan-2000", "to 31-Dec-2002", "from 01-Jan-2003", "to 03-Mar-2004", "from 04-Mar-2004", "to 05-Nov-2005", "from 06-Nov-2005", "to 31-Dec-2007", "from 01-Jan-2008", "to 03-Mar-2008", "from 04-Mar-2008", "to 30-Nov-2009"), times=2),
district= c(rep(1001:1002, each=12)))
> df.a1
executive.name date district
1 Johnson from 01-Jan-2000 1001
2 Johnson to 31-Dec-2002 1001
3 Alleghany from 01-Jan-2003 1001
4 Alleghany to 03-Mar-2004 1001
5 Clarke from 04-Mar-2004 1001
6 Clarke to 05-Nov-2005 1001
7 Roland from 06-Nov-2005 1001
8 Roland to 31-Dec-2007 1001
9 Tollson from 01-Jan-2008 1001
10 Tollson to 03-Mar-2008 1001
11 Richards from 04-Mar-2008 1001
12 Richards to 30-Nov-2009 1001
13 Peters from 01-Jan-2000 1002
14 Peters to 31-Dec-2002 1002
15 Harrison from 01-Jan-2003 1002
16 Harrison to 03-Mar-2004 1002
17 Burr from 04-Mar-2004 1002
18 Burr to 05-Nov-2005 1002
19 Diamond from 06-Nov-2005 1002
20 Diamond to 31-Dec-2007 1002
21 Edwards from 01-Jan-2008 1002
22 Edwards to 03-Mar-2008 1002
23 Gorman from 04-Mar-2008 1002
24 Gorman to 30-Nov-2009 1002
I will like it to look like this:
df.a1.neat <- data.frame(executive.name= c("Johnson", "Johnson", "Alleghany", "Clarke", "Clarke", "Roland", "Roland", "Richards", "Richards", "Peters", "Peters", "Harrison", "Burr", "Burr", "Diamond", "Diamond", "Gorman", "Gorman"),
date= rep(c(2000, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009), times=2),
district= c(rep(1001:1002, each=9)))
executive.name date district
1 Johnson 2000 1001
2 Johnson 2002 1001
3 Alleghany 2003 1001
4 Clarke 2004 1001
5 Clarke 2005 1001
6 Roland 2006 1001
7 Roland 2007 1001
8 Richards 2008 1001
9 Richards 2009 1001
10 Peters 2000 1002
11 Peters 2002 1002
12 Harrison 2003 1002
13 Burr 2004 1002
14 Burr 2005 1002
15 Diamond 2006 1002
16 Diamond 2007 1002
17 Gorman 2008 1002
18 Gorman 2009 1002
This is including all years, because the logic is easier to handle programmatically - if you have a reason that you wish some years removed, I'll let you handle that aspect.
library(dplyr)
library(tidyr)
library(lubridate)
library(stringr)
df.a1 |>
# Set up date_from and date_to vars for period calculation
mutate(date_from = if_else(str_detect(date, "from"), date, lag(date)),
date_to = if_else(str_detect(date, "to"), date, lead(date))) |>
select(-date) |>
# Remove duplicates
distinct() |>
# Further preparing for period calculation
mutate(date_from = str_remove(date_from, "from "),
date_to = str_remove(date_to, "to "),
across(date_from:date_to, dmy),
year_from = year(date_from),
year_to = year(date_to)) |>
# Setting data to long format in order to vectorize period calculation
pivot_longer(year_from:year_to, names_to = "start_end", values_to = "year") |>
# Filling in all years and then resorting the rows back to original sort
group_by(executive.name, district) |>
complete(year = full_seq(min(year):max(year), 1)) |>
ungroup() |>
arrange(district, year) |>
# Calculating the period for each year, district, and name
mutate(period = case_when(year(date_from) == year(date_to) ~ date_to - date_from,
start_end == "year_from" ~ as.Date(paste0(year, "-12-31")) - date_from,
start_end == "year_to" ~ date_to - as.Date(paste0(year, "-01-01")),
TRUE ~ as.Date(paste0(year, "-12-31")) - as.Date(paste0(year, "-01-01"))), .by = year) |>
# Remove unwanted names for a given district and year
filter(period == max(period), .by = c(district, year)) |>
# Set data frame to desired output
select(executive.name, year, district)
# A tibble: 20 × 3
executive.name year district
<chr> <dbl> <int>
1 Johnson 2000 1001
2 Johnson 2001 1001
3 Johnson 2002 1001
4 Alleghany 2003 1001
5 Clarke 2004 1001
6 Clarke 2005 1001
7 Roland 2006 1001
8 Roland 2007 1001
9 Richards 2008 1001
10 Richards 2009 1001
11 Peters 2000 1002
12 Peters 2001 1002
13 Peters 2002 1002
14 Harrison 2003 1002
15 Burr 2004 1002
16 Burr 2005 1002
17 Diamond 2006 1002
18 Diamond 2007 1002
19 Gorman 2008 1002
20 Gorman 2009 1002