Search code examples
rdplyrtop-n

How to summarize the top 3 highest values in a dataset when there are ties


I have a data frame (my_data) and want to calculate the sum of only the 3 highest values even though there might be ties. I am quite new to R and I've used dplyr.

A tibble: 15 x 3
   city      month number
   <chr>     <chr>  <dbl>
 1 Lund      jan       12
 2 Lund      feb       12
 3 Lund      mar       18
 4 Lund      apr       28
 5 Lund      may       28
 6 Stockholm jan       15
 7 Stockholm feb       15
 8 Stockholm mar       30
 9 Stockholm apr       30
10 Stockholm may       10
11 Uppsala   jan       22
12 Uppsala   feb       30
13 Uppsala   mar       40
14 Uppsala   apr       60
15 Uppsala   may       30

This is the code I have tried:

# For each city, count the top 3 of variable number
my_data %>% group_by(city) %>% top_n(3, number) %>% summarise(top_nr = sum(number))

The expected (wanted) output is:

# A tibble: 3 x 2
  city      top_nr
  <chr>      <dbl>
1 Lund          86
2 Stockholm     75
3 Uppsala      130

but the actual R output is:

# A tibble: 3 x 2
  city      top_nr
  <chr>      <dbl>
1 Lund          86
2 Stockholm     90
3 Uppsala      160

It seems like if there are ties, all tied values are included in the summation. I wanted only 3 unique instances with highest values to be counted.

Any help would be much appreciated! :)


Solution

  • Life might be way simpler without top_n():

    dat %>%
      group_by(city) %>%
      summarize(
        top_nr = sum(tail(sort(number), 3))
        )