I have a data frame (my_data) and want to calculate the sum of only the 3 highest values even though there might be ties. I am quite new to R and I've used dplyr
.
A tibble: 15 x 3
city month number
<chr> <chr> <dbl>
1 Lund jan 12
2 Lund feb 12
3 Lund mar 18
4 Lund apr 28
5 Lund may 28
6 Stockholm jan 15
7 Stockholm feb 15
8 Stockholm mar 30
9 Stockholm apr 30
10 Stockholm may 10
11 Uppsala jan 22
12 Uppsala feb 30
13 Uppsala mar 40
14 Uppsala apr 60
15 Uppsala may 30
This is the code I have tried:
# For each city, count the top 3 of variable number
my_data %>% group_by(city) %>% top_n(3, number) %>% summarise(top_nr = sum(number))
The expected (wanted) output is:
# A tibble: 3 x 2
city top_nr
<chr> <dbl>
1 Lund 86
2 Stockholm 75
3 Uppsala 130
but the actual R output is:
# A tibble: 3 x 2
city top_nr
<chr> <dbl>
1 Lund 86
2 Stockholm 90
3 Uppsala 160
It seems like if there are ties, all tied values are included in the summation. I wanted only 3 unique instances with highest values to be counted.
Any help would be much appreciated! :)
This tidyverse
(actually, dplyr
) solution is almost equal to akrun's, but filter
s the dataframe instead of getting the top_n
.
library(tidyverse)
my_data %>%
group_by(city) %>%
arrange(desc(number), .by_group = TRUE) %>%
filter(row_number() %in% 1:3) %>%
summarise(top_nr = sum(number))
## A tibble: 3 x 2
# city top_nr
# <chr> <int>
#1 Lund 74
#2 Stockholm 75
#3 Uppsala 130