How can I subset a dataframe in multiple dataframes by values?

I have a very long dataframe structured as follows:

df <- data.frame(chr = c("chr1","chr1","chr1","chr1","chr1","chr1","chr1","chr1","chr1","chr1"), start = c(100,300,500,700,900,1100,1300,1500,1900,2100), end = c(200,400,600,800,1000,1200,1400,1600,1800,2000), sv = c("si", "si", "si", "si","si", "si","si", "si","si", "si"))

How can I calculate how many "si" there are every 500. So, from 0 (start) to 500 (end) then from 501 (start) to 1001 (end) etc etc.

I tried creating vector of start and end coordinates like this:

start <- c(1,501,1002,1503) 
end <- c(500, 1001, 1502, 2003)

And tried with this:

calculate <- function(df,start,end) {
subset(df, start >= start & end <= end)
table(df$sv)
}

But it doesn't give me how many "si" there were for every 500. It just tells me the total count of "si"

Any suggestions?

Solution

Using cut and consecutive_id (>= dplyr 1.1.0)

library(dplyr)

df %>% 
  group_by(grp = consecutive_id(cut(start, 
                   seq(0, start[nrow(df)], 500), right = F))) %>% 
  mutate(Count = sum(sv == "si")) %>% 
  ungroup() %>% 
  select(-grp)
# A tibble: 10 × 5
   chr   start   end sv    Count
   <chr> <dbl> <dbl> <chr> <int>
 1 chr1    100   200 si        2
 2 chr1    300   400 si        2
 3 chr1    500   600 si        3
 4 chr1    700   800 si        3
 5 chr1    900  1000 si        3
 6 chr1   1100  1200 si        2
 7 chr1   1300  1400 si        2
 8 chr1   1500  1600 si        2
 9 chr1   1900  1800 si        2
10 chr1   2100  2000 si        1