I am trying to write a loop in R to collapse rows. Let's say I'm using a data frame like this one:
r1 <- c(1, 1,1000,2)
r2 <- c(1, 1001,2000, 2)
r3 <- c(1, 2001,3000, 2)
r4 <- c(1, 3001,4000, 1)
r5 <- c(1, 4001,5000, 3)
r6 <- c(1, 5001,6000, 3)
r7 <- c(2, 1,1000,2 )
r8 <- c(2, 1001,2000, 1)
r9 <- c(2, 2001,3000, 2)
r10 <- c(2, 3001,4000, 1)
r11 <- c(2, 4001,5000, 1)
test <- rbind(r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11)
test <- as.data.frame(test)
colnames(test) <- c("chr", "start","end", "abs.sum")
rownames(test) <- NULL
This gives me a data frame that looks like this:
chr start end abs.sum
1 1 1 1000 2
2 1 1001 2000 2
3 1 2001 3000 2
4 1 3001 4000 1
5 1 4001 5000 3
6 1 5001 6000 3
7 2 1 1000 2
8 2 1001 2000 1
9 2 2001 3000 2
10 2 3001 4000 1
11 2 4001 5000 1
For each chr
value, I want to collapse based on identical consecutive abs.sum
, keeping the lowest value in start
and the highest value in end
. So, for example, I would like my final data frame to look like this:
chr start end abs.sum
1 1 1 3000 2
2 1 3001 4000 1
3 1 4001 6000 3
4 2 1 1000 2
5 2 1001 2000 1
6 2 2001 3000 2
7 2 3001 5000 1
I tried writing a for loop:
for (i in 1:nrow(test)) {
if (test$abs.sum[i] == test$abs.sum[i + 1]) {
test$end[i] <- test$end[i+1]
test <- test[-i + 1]
test <- test[-(i + 1),]
}
}
Which returns the error:
Error in if (test$abs.sum[i] == test$abs.sum[i + 1]) { : argument is of length zero
I know this isn't correct, but it's what I have been able to piece together so far. I think this may require some combination of "while" and "for" loops, but I am stuck. Maybe a package already exists with a function that can do this?
You can do this leveraging run-length id:
library(data.table)
setDT(test)[, .(start=min(start), end=max(end), abs.sum=min(abs.sum)), .(chr,rleid(abs.sum))][,-2]
Output:
chr start end abs.sum
<num> <num> <num> <num>
1: 1 1 3000 2
2: 1 3001 4000 1
3: 1 4001 6000 3
4: 2 1 1000 2
5: 2 1001 2000 1
6: 2 2001 3000 2
7: 2 3001 5000 1
Here is another option using dplyr
(but note I retain the use of . Thanks to @LeroyTyrone for pointing out the data.table:rleid
consecutive_id()
function)
library(dplyr)
test %>%
group_by(chr, id=consecutive_id(abs.sum)) %>%
summarize(start=min(start), end=max(end), abs.sum=min(abs.sum)) %>%
select(-id)
Output:
chr start end abs.sum
<dbl> <dbl> <dbl> <dbl>
1 1 1 3000 2
2 1 3001 4000 1
3 1 4001 6000 3
4 2 1 1000 2
5 2 1001 2000 1
6 2 2001 3000 2
7 2 3001 5000 1