I am working with a dataframe on county executives. I want to run a panel study where the unit of analysis is the county-year.
The problem is that sometimes two or more county executives serve during the same year. I want to remove these semi-duplicate rows. I ALWAYS want to keep the county executive that is listed first.
If my initial df is:
df <- data.frame(year= c(2000, 2001, 2001, 2002, 2000, 2001, 2002, 2002, 2002),
executive.name= c("Johnson", "Smith", "Peters", "Alleghany", "Roberts", "Clarke", "Tollson", "Brown", "Taschen"),
district= c(1001, 1001, 1001, 1001, 1002, 1002, 1002, 1002, 1002))
I want to make it look like this
df.neat <- data.frame(year= c(2000, 2001, 2002, 2000, 2001, 2002),
executive.name= c("Johnson", "Smith", "Alleghany", "Roberts", "Clarke", "Tollson"),
district= rep(c(1001, 1002), each=3))
You can do group by district and year, then take the first row from each group.
library(dplyr)
df_neat <- df %>%
group_by(district, year) %>%
slice_head(n = 1) %>%
ungroup()