I have table like this and would like to count genes that appear the most (lets say top 10 genes) and then find out mean of tail_len for those top 10 genes.
gene | tail_len | |
---|---|---|
1 | SPAC20G4.06c | 3 |
2 | SPCC613.06 | 5 |
3 | SPAC6F6.03c | 2 |
4 | SPAC20G4.06c | 3 |
5 | SPBC23G7.15c | 5 |
6 | SPAC589.10c | 2 |
7 | SPBC23G7.15c | 3 |
8 | SPAC22H12.04c | 1 |
9 | SPAC22H12.04c | 12 |
10 | SPAC6G10.11c | 8 |
11 | SPAC589.10c | 31 |
12 | SPBC18E5.06 | 16 |
Here is a way with slice_max
. I have defined two variables, ties_ok
and max_n
. The latter is set to 3 to test the code, you want max_n <- 110
, the former can bee set to FALSE
if you want to discard ties and keep only the first rows found.
df1 <- " gene tail_len
1 SPAC20G4.06c 3
2 SPCC613.06 5
3 SPAC6F6.03c 2
4 SPAC20G4.06c 3
5 SPBC23G7.15c 5
6 SPAC589.10c 2
7 SPBC23G7.15c 3
8 SPAC22H12.04c 1
9 SPAC22H12.04c 12
10 SPAC6G10.11c 8
11 SPAC589.10c 31
12 SPBC18E5.06 16"
df1 <- read.table(text = df1, header = TRUE)
suppressPackageStartupMessages(
library(dplyr)
)
ties_ok <- TRUE
#ties_ok <- FALSE
max_n <- 3L
df1 %>%
group_by(gene) %>%
summarise(count = n(), mean_tail_len = mean(tail_len)) %>%
slice_max(count, n = max_n, with_ties = ties_ok) %>%
select(-count)
#> # A tibble: 4 × 2
#> gene mean_tail_len
#> <chr> <dbl>
#> 1 SPAC20G4.06c 3
#> 2 SPAC22H12.04c 6.5
#> 3 SPAC589.10c 16.5
#> 4 SPBC23G7.15c 4
Created on 2023-01-20 with reprex v2.0.2