Search code examples
rdataframedplyrsubset

How can I subset a dataframe in multiple dataframes by values?


I have a very long dataframe structured as follows:

df <- data.frame(chr = c("chr1","chr1","chr1","chr1","chr1","chr1","chr1","chr1","chr1","chr1"), start = c(100,300,500,700,900,1100,1300,1500,1900,2100), end = c(200,400,600,800,1000,1200,1400,1600,1800,2000), sv = c("si", "si", "si", "si","si", "si","si", "si","si", "si"))

How can I calculate how many "si" there are every 500. So, from 0 (start) to 500 (end) then from 501 (start) to 1001 (end) etc etc.

I tried creating vector of start and end coordinates like this:

start <- c(1,501,1002,1503) 
end <- c(500, 1001, 1502, 2003)

And tried with this:

calculate <- function(df,start,end) {
subset(df, start >= start & end <= end)
table(df$sv)
}

But it doesn't give me how many "si" there were for every 500. It just tells me the total count of "si"

Any suggestions?


Solution

  • Using cut and consecutive_id (>= dplyr 1.1.0)

    library(dplyr)
    
    df %>% 
      group_by(grp = consecutive_id(cut(start, 
                       seq(0, start[nrow(df)], 500), right = F))) %>% 
      mutate(Count = sum(sv == "si")) %>% 
      ungroup() %>% 
      select(-grp)
    # A tibble: 10 × 5
       chr   start   end sv    Count
       <chr> <dbl> <dbl> <chr> <int>
     1 chr1    100   200 si        2
     2 chr1    300   400 si        2
     3 chr1    500   600 si        3
     4 chr1    700   800 si        3
     5 chr1    900  1000 si        3
     6 chr1   1100  1200 si        2
     7 chr1   1300  1400 si        2
     8 chr1   1500  1600 si        2
     9 chr1   1900  1800 si        2
    10 chr1   2100  2000 si        1