Search code examples
rdplyruniquedistinct

R dplyr, distinct, unique combination of variables, with maximum value of third


I'm close but don't have the syntax correct. I'm trying to select all columns of a data table based on selection of unique combinations of two variables (columns) based on the maximum value of a third. MWE of progress thus far. Thx. J

library(dplyr)

dt1 <- tibble (var1 = c("num1", "num2", "num3", "num4", "num5"),
               var2 =  rep("A", 5),
               var3 = c(rep("B", 2), rep("C", 3)),
               var4 = c(5, 10, 3, 7, 19))

dt1 %>% distinct(var2, var3, max(var4),  .keep_all =  TRUE)

# A tibble: 2 x 5
  var1  var2  var3   var4 `max(var4)`
  <chr> <chr> <chr> <dbl>       <dbl>
1 num1  A     B         5          19
2 num3  A     C         3          19

which is close, but I want the row where the value of var4 is the max value, within the unique combination of var2 and var3. I'm attempting to get:

# A tibble: 2 x 5
  var1  var2  var3   var4 `max(var4)`
  <chr> <chr> <chr> <dbl>       <dbl>
1 num2  A     B         5          10
2 num5  A     C         3          19

Do I need a formula for the third argument of the distinct function?


Solution

  • We can add an arrange statement before the distinct

    library(dplyr)
    dt1 %>%
         arrange(var2, var3, desc(var4)) %>%
         distinct(var2, var3, .keep_all = TRUE)
    

    -output

    # A tibble: 2 x 4
      var1  var2  var3   var4
      <chr> <chr> <chr> <dbl>
    1 num2  A     B        10
    2 num5  A     C        19
    

    Or another option is slice_max

    dt1 %>%
        group_by(var2, var3) %>%
        mutate(var4new = first(var4)) %>% 
        slice_max(order_by= var4, n = 1) %>% 
        ungroup
    

    -output

    # A tibble: 2 x 5
      var1  var2  var3   var4 var4new
      <chr> <chr> <chr> <dbl>   <dbl>
    1 num2  A     B        10       5
    2 num5  A     C        19       3