Search code examples
rdatabasedataframedata-wrangling

Remove duplicate rows, keep first row


I am working with a dataframe on county executives. I want to run a panel study where the unit of analysis is the county-year.

The problem is that sometimes two or more county executives serve during the same year. I want to remove these semi-duplicate rows. I ALWAYS want to keep the county executive that is listed first.

If my initial df is:

df <- data.frame(year= c(2000, 2001, 2001, 2002, 2000, 2001, 2002, 2002, 2002),
                  executive.name= c("Johnson", "Smith", "Peters", "Alleghany", "Roberts", "Clarke", "Tollson", "Brown", "Taschen"),
                  district= c(1001, 1001, 1001, 1001, 1002, 1002, 1002, 1002, 1002))

I want to make it look like this

df.neat <- data.frame(year= c(2000, 2001, 2002, 2000, 2001, 2002),
                  executive.name= c("Johnson", "Smith", "Alleghany", "Roberts", "Clarke", "Tollson"),
                  district= rep(c(1001, 1002), each=3))

Solution

  • You can do group by district and year, then take the first row from each group.

    library(dplyr)
    df_neat <- df %>%
      group_by(district, year) %>%
      slice_head(n = 1) %>%
      ungroup()