Search code examples
rfilterduplicatesdrop

How to remove or drop duplicates based on two column (ID and quarter)? R Studio


My dataframe2 looks like this

dataframe2 = structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), 
                            Quarterly = c(2006.1, 2006.1, 2006.1, 2006.2, 2006.3, 2006.4, 2006.1, 2006.2, 2006.3, 
                                          2006.3, 2006.4, 2006.1, 2006.1, 2006.1, 2006.2, 2006.2, 2006.3, 2006.4), 
                            Status = c("Employed", "Employed", "Employed", "Employed", "Null", "Employed", "Employed",
                                       "Employed", "Employed", "Employed", "Employed", "Null", "Null", "Employed", 
                                       "Employed", "Employed", "Employed", "Employed")), 
                       class = "data.frame", row.names = c(NA, -18L))
ID   Quarterly    Status 
1    2006.1     Employed 
1    2006.1     Employed 
1    2006.1     Employed 
1    2006.2     Employed 
1    2006.3     Null 
1    2006.4     Employed 
2    2006.1     Employed 
2    2006.2     Employed 
2    2006.3     Employed 
2    2006.3     Employed 
2    2006.4     Employed 
3    2006.1     Null 
3    2006.1     Null 
3    2006.1     Employed 
3    2006.2     Employed 
3    2006.2     Employed 
3    2006.3     Employed 
3    2006.4     Employed 

I'm keen for it to look like this, so that there is only one observation per ID

ID   Quarterly    Status 
1    2006.1.    Employed 
1    2006.2     Employed 
1    2006.3     Null 
1    2006.4     Employed 
2    2006.1     Employed 
2    2006.2     Employed 
2    2006.3     Employed 
2    2006.4     Employed 
3    2006.1     Null 
3    2006.2     Employed 
3    2006.3     Employed 
3    2006.4     Employed 

I've tried a couple of options that I had found on this website, and none have worked as hoped. I did: (a) group_by(ID,Quarterly) %>% filer(n()>1) and (b) group_by(ID,Quarterly) %>% distinct(ID, keep.all = TRUE). Option (b) just removed all the Quarterly values except 2023.1 (the latest my dates go to).


Solution

  • The way to do it with dplyr is using summarise

    result <- dataframe2  |> summarise(.by = c("ID", "Quarterly"), first(Status))
    

    For simplicity, I took the function first to summarize the group over ID and Quaterly