Search code examples
rdplyrduplicates

Keep last occurrence when using dplyr::distinct(), not the first (can't use arrange() here)


I have a dataframe like this:

# A tibble: 4 x 5
  category month   comment             score email
  <chr>    <chr>   <chr>               <dbl> <chr>
1 neutro   2020-01 ""                      8 xxx  
2 promotor 2020-04 "ok"                    9 xxx  
3 promotor 2020-04 "very cool"             9 xxx  
4 promotor 2020-05 "i really liked it"     9 xxx

Unfortunatelly, there was a survey, but with mistakes (client could answer more than one time!).
So now I'm trying to keep only the last answer, within each group.
When I use dplyr::distinct(), he keeps the first occurence:

df %>% 
   distinct(category, month, score, email, .keep_all = T)

# A tibble: 3 x 5
  category month   comment             score email
  <chr>    <chr>   <chr>               <dbl> <chr>
1 neutro   2020-01 ""                      8 xxx  
2 promotor 2020-04 "ok"                    9 xxx  
3 promotor 2020-05 "i really liked it"     9 xxx

But I would like to keep the last one, so this is my desired result:

# A tibble: 4 x 5
  category month   comment             score email
  <chr>    <chr>   <chr>               <dbl> <chr>
1 neutro   2020-01 ""                      8 xxx  
2 promotor 2020-04 "very cool"             9 xxx  
3 promotor 2020-05 "i really liked it"     9 xxx

Obs.: As I cited in the title, I can't arrange the grouped columns.


Solution

  • Could you group_by?

    library(dplyr)
    
    df %>%
      group_by(category, month, score, email) %>% # Also group_by(across(-comment)) would work with the example
      slice_tail() %>%
      ungroup()
    

    Output:

    # A tibble: 3 x 5
      category month   comment             score email
      <fct>    <fct>   <fct>               <int> <fct>
    1 neutro   2020-01 ""                      8 xxx  
    2 promotor 2020-04 "very cool"             9 xxx  
    3 promotor 2020-05 "i really liked it"     9 xxx