I have a very long dataframe structured as follows:
df <- data.frame(chr = c("chr1","chr1","chr1","chr1","chr1","chr1","chr1","chr1","chr1","chr1"), start = c(100,300,500,700,900,1100,1300,1500,1900,2100), end = c(200,400,600,800,1000,1200,1400,1600,1800,2000), sv = c("si", "si", "si", "si","si", "si","si", "si","si", "si"))
How can I calculate how many "si" there are every 500. So, from 0 (start) to 500 (end) then from 501 (start) to 1001 (end) etc etc.
I tried creating vector of start and end coordinates like this:
start <- c(1,501,1002,1503)
end <- c(500, 1001, 1502, 2003)
And tried with this:
calculate <- function(df,start,end) {
subset(df, start >= start & end <= end)
table(df$sv)
}
But it doesn't give me how many "si" there were for every 500. It just tells me the total count of "si"
Any suggestions?
Using cut
and consecutive_id
(>= dplyr 1.1.0)
library(dplyr)
df %>%
group_by(grp = consecutive_id(cut(start,
seq(0, start[nrow(df)], 500), right = F))) %>%
mutate(Count = sum(sv == "si")) %>%
ungroup() %>%
select(-grp)
# A tibble: 10 × 5
chr start end sv Count
<chr> <dbl> <dbl> <chr> <int>
1 chr1 100 200 si 2
2 chr1 300 400 si 2
3 chr1 500 600 si 3
4 chr1 700 800 si 3
5 chr1 900 1000 si 3
6 chr1 1100 1200 si 2
7 chr1 1300 1400 si 2
8 chr1 1500 1600 si 2
9 chr1 1900 1800 si 2
10 chr1 2100 2000 si 1