Search code examples
rrna-seq

Create new column based on mean of values that are found in specific rows in R


I have table like this and would like to count genes that appear the most (lets say top 10 genes) and then find out mean of tail_len for those top 10 genes.

gene tail_len
1 SPAC20G4.06c 3
2 SPCC613.06 5
3 SPAC6F6.03c 2
4 SPAC20G4.06c 3
5 SPBC23G7.15c 5
6 SPAC589.10c 2
7 SPBC23G7.15c 3
8 SPAC22H12.04c 1
9 SPAC22H12.04c 12
10 SPAC6G10.11c 8
11 SPAC589.10c 31
12 SPBC18E5.06 16

Solution

  • Here is a way with slice_max. I have defined two variables, ties_ok and max_n. The latter is set to 3 to test the code, you want max_n <- 110, the former can bee set to FALSE if you want to discard ties and keep only the first rows found.

    df1 <- "    gene    tail_len
    1   SPAC20G4.06c    3
    2   SPCC613.06  5
    3   SPAC6F6.03c     2
    4   SPAC20G4.06c    3
    5   SPBC23G7.15c    5
    6   SPAC589.10c     2
    7   SPBC23G7.15c    3
    8   SPAC22H12.04c   1
    9   SPAC22H12.04c   12
    10  SPAC6G10.11c    8
    11  SPAC589.10c     31
    12  SPBC18E5.06     16"
    df1 <- read.table(text = df1, header = TRUE)
    
    suppressPackageStartupMessages(
      library(dplyr)
    )
    
    ties_ok <- TRUE
    #ties_ok <- FALSE
    max_n <- 3L
    df1 %>%
      group_by(gene) %>%
      summarise(count = n(), mean_tail_len = mean(tail_len)) %>%
      slice_max(count, n = max_n, with_ties = ties_ok) %>%
      select(-count)
    #> # A tibble: 4 × 2
    #>   gene          mean_tail_len
    #>   <chr>                 <dbl>
    #> 1 SPAC20G4.06c            3  
    #> 2 SPAC22H12.04c           6.5
    #> 3 SPAC589.10c            16.5
    #> 4 SPBC23G7.15c            4
    

    Created on 2023-01-20 with reprex v2.0.2