Search code examples
rdplyrgreatest-n-per-group

dplyr: max value in a group, excluding the value in each row?


I have a data frame that looks as follows:

> df <- data_frame(g = c('A', 'A', 'B', 'B', 'B', 'C'), x = c(7, 3, 5, 9, 2, 4))
> df
Source: local data frame [6 x 2]

  g x
1 A 7
2 A 3
3 B 5
4 B 9
5 B 2
6 C 4

I know how to add a column with the maximum x value for each group g:

> df %>% group_by(g) %>% mutate(x_max = max(x))
Source: local data frame [6 x 3]
Groups: g

  g x x_max
1 A 7     7
2 A 3     7
3 B 5     9
4 B 9     9
5 B 2     9
6 C 4     4

But what I would like is to get is the maximum x value for each group g, excluding the x value in each row.

For the given example, the desired output would look like this:

Source: local data frame [6 x 3]
Groups: g

  g x x_max x_max_exclude
1 A 7     7             3
2 A 3     7             7
3 B 5     9             9
4 B 9     9             5
5 B 2     9             9
6 C 4     4            NA

I thought I might be able to use row_number() to remove particular elements and take the max of what remained, but hit warning messages and got incorrect -Inf output:

> df %>% group_by(g) %>% mutate(x_max = max(x), r = row_number(), x_max_exclude = max(x[-r]))
Source: local data frame [6 x 5]
Groups: g

  g x x_max r x_max_exclude
1 A 7     7 1          -Inf
2 A 3     7 2          -Inf
3 B 5     9 1          -Inf
4 B 9     9 2          -Inf
5 B 2     9 3          -Inf
6 C 4     4 1          -Inf
Warning messages:
1: In max(c(4, 9, 2)[-1:3]) :
  no non-missing arguments to max; returning -Inf
2: In max(c(4, 9, 2)[-1:3]) :
  no non-missing arguments to max; returning -Inf
3: In max(c(4, 9, 2)[-1:3]) :
  no non-missing arguments to max; returning -Inf

What is the most {readable, concise, efficient} way to get this output in dplyr? Any insight into why my attempt using row_number() doesn't work would also be much appreciated. Thanks for the help.


Solution

  • You could try:

    df %>% 
      group_by(g) %>% 
      arrange(desc(x)) %>% 
      mutate(max = ifelse(x == max(x), x[2], max(x)))
    

    Which gives:

    #Source: local data frame [6 x 3]
    #Groups: g
    #
    #  g x max
    #1 A 7   3
    #2 A 3   7
    #3 B 9   5
    #4 B 5   9
    #5 B 2   9
    #6 C 4  NA
    

    Benchmark

    I've tried the solutions so far on the benchmark:

    df <- data.frame(g = sample(LETTERS, 10e5, replace = TRUE),
                     x = sample(1:10, 10e5, replace = TRUE))
    
    library(microbenchmark)
    
    mbm <- microbenchmark(
      steven = df %>% 
        group_by(g) %>% 
        arrange(desc(x)) %>% 
        mutate(max = ifelse(x == max(x), x[2], max(x))),
      eric = df %>% 
        group_by(g) %>% 
        mutate(x_max = max(x), 
               x_max2 = sort(x, decreasing = TRUE)[2], 
               x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>% 
        select(-x_max2),
      arun = setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g],
      times = 50
    )
    

    @Arun's data.table solution is the fastest:

    # Unit: milliseconds
    #    expr       min        lq      mean    median       uq      max neval cld
    #  steven 158.58083 163.82669 197.28946 210.54179 212.1517 260.1448    50  b 
    #    eric 223.37877 228.98313 262.01623 274.74702 277.1431 284.5170    50   c
    #    arun  44.48639  46.17961  54.65824  47.74142  48.9884 102.3830    50 a   
    

    enter image description here