Search code examples
rdataframecountsubset

Count the non NA value in every 5th column and assign the total count to a new column


I have a dataframe (tibble) with over 3000 rows and over 100 cols. I want to count every 5th columns (e.g., StoZ.x, StoZ.y, StoZ.x.x, StoZ.y.y like that) if any data presence or not. If data is present on 5th column, then total count would be added by creating a new column.

A tibble: 1,500 × 23
chr     start   end.x   end.y StoZ.x Tier.x Gene.x cohort.x    end.x.x StoZ.y Tier.y Gene.y cohort.y
> >    <chr>   <dbl>   <dbl>   <dbl>  <dbl> <chr>  <chr>  <chr>         <dbl>  <dbl> <chr>  <chr>  <chr>   
1 chr1  2220001 2230000 2230000  -2.68 T2     SKI    cr19naBile…      NA  NA    NA     NA     NA      
2 chr1  2230001 2240000      NA  NA    NA     NA     NA               NA  NA    NA     NA     NA      
3 chr1  2240001 2250000      NA  NA    NA     NA     NA               NA  NA    NA     NA     NA      
4 chr1  2250001 2260000 2260000  -2.24 T2     SKI    cr19naBile… 2260000  -2.90 T2     SKI    ji15pnH…
5 chr1  2270001 2280000      NA  NA    NA     NA     NA               NA  NA    NA     NA     NA      
6 chr1  2280001 2290000      NA  NA    NA     NA     NA               NA  NA    NA     NA     NA      
7 chr1  2290001 2300000      NA  NA    NA     NA     NA               NA  NA    NA     NA     NA      
8 chr1  2550001 2560000      NA  NA    NA     NA     NA          2560000  -2.13 T1     TNFRS… ji15pnH…

The expected table would be like that, a addition column Count will refer the presence of gene in particular window.

A tibble: 1,500 × 23
chr     start   end.x   end.y StoZ.x Tier.x Gene.x cohort.x    end.x.x StoZ.y Tier.y Gene.y cohort.y   Count
> > >    <chr>   <dbl>   <dbl>   <dbl>  <dbl> <chr>  <chr>  <chr>         <dbl>  <dbl> <chr>  <chr>  <chr>   <num>
1 chr1  2220001 2230000 2230000  -2.68 T2     SKI    cr19naBile…      NA  NA    NA     NA     NA 1          
2 chr1  2230001 2240000      NA  NA    NA     NA     NA               NA  NA    NA     NA     NA 0     
3 chr1  2240001 2250000      NA  NA    NA     NA     NA               NA  NA    NA     NA     NA 0      
4 chr1  2250001 2260000 2260000  -2.24 T2     SKI    cr19naBile… 2260000  -2.90 T2     SKI    ji15pnH…. 2
5 chr1  2270001 2280000      NA  NA    NA     NA     NA               NA  NA    NA     NA     NA 0     
6 chr1  2280001 2290000      NA  NA    NA     NA     NA               NA  NA    NA     NA     NA 0     
7 chr1  2290001 2300000      NA  NA    NA     NA     NA               NA  NA    NA     NA     NA 0     
8 chr1  2550001 2560000      NA  NA    NA     NA     NA          2560000  -2.13 T1     TNFRS… ji15pnH… 1

Solution

  • You can use pick from the dplyr package to select the columns whose name starts with "StoZ" and sum the number of non-NA values by row.

    library(dplyr) # packageVersion("dplyr") >= 1.1.0
    
    ndf |> 
      mutate(count = rowSums(!is.na(pick(starts_with("StoZ")))))
    
    #>    chr   start   end.x   end.y StoZ.x Tier.x Gene.x    cohort.x end.x.x StoZ.y Tier.y Gene.y cohort.y count
    #> 1 chr1 2220001 2230000 2230000  -2.68     T2    SKI cr19naBile…      NA     NA   <NA>   <NA>     <NA>     1
    #> 2 chr1 2230001 2240000      NA     NA   <NA>   <NA>        <NA>      NA     NA   <NA>   <NA>     <NA>     0
    #> 3 chr1 2240001 2250000      NA     NA   <NA>   <NA>        <NA>      NA     NA   <NA>   <NA>     <NA>     0
    #> 4 chr1 2250001 2260000 2260000  -2.24     T2    SKI cr19naBile… 2260000  -2.90     T2    SKI ji15pnH…     2
    #> 5 chr1 2270001 2280000      NA     NA   <NA>   <NA>        <NA>      NA     NA   <NA>   <NA>     <NA>     0
    #> 6 chr1 2280001 2290000      NA     NA   <NA>   <NA>        <NA>      NA     NA   <NA>   <NA>     <NA>     0
    #> 7 chr1 2290001 2300000      NA     NA   <NA>   <NA>        <NA>      NA     NA   <NA>   <NA>     <NA>     0
    #> 8 chr1 2550001 2560000      NA     NA   <NA>   <NA>        <NA> 2560000  -2.13     T1 TNFRS… ji15pnH…     1