Search code examples
rdataframeloopsrow

Write a loop in R to collapse rows based on identical consecutive column values


I am trying to write a loop in R to collapse rows. Let's say I'm using a data frame like this one:

r1 <- c(1, 1,1000,2)
r2 <- c(1, 1001,2000, 2)
r3 <- c(1, 2001,3000, 2)
r4 <- c(1, 3001,4000, 1)
r5 <- c(1, 4001,5000, 3)
r6 <- c(1, 5001,6000, 3)
r7 <- c(2, 1,1000,2 )
r8 <- c(2, 1001,2000, 1)
r9 <- c(2, 2001,3000, 2)
r10 <- c(2, 3001,4000, 1)
r11 <- c(2, 4001,5000, 1)
test <- rbind(r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11)
test <- as.data.frame(test)
colnames(test) <- c("chr", "start","end", "abs.sum")
rownames(test) <- NULL

This gives me a data frame that looks like this:

   chr start  end abs.sum
1    1     1 1000       2
2    1  1001 2000       2
3    1  2001 3000       2
4    1  3001 4000       1
5    1  4001 5000       3
6    1  5001 6000       3
7    2     1 1000       2
8    2  1001 2000       1
9    2  2001 3000       2
10   2  3001 4000       1
11   2  4001 5000       1

For each chr value, I want to collapse based on identical consecutive abs.sum, keeping the lowest value in start and the highest value in end. So, for example, I would like my final data frame to look like this:

  chr start  end abs.sum
1   1     1 3000       2
2   1  3001 4000       1
3   1  4001 6000       3
4   2     1 1000       2
5   2  1001 2000       1
6   2  2001 3000       2
7   2  3001 5000       1

I tried writing a for loop:

for (i in 1:nrow(test)) {
        
        if (test$abs.sum[i] == test$abs.sum[i + 1]) {
                test$end[i] <- test$end[i+1]
                test <- test[-i + 1]
                test <- test[-(i + 1),]
        }
        
}

Which returns the error:

Error in if (test$abs.sum[i] == test$abs.sum[i + 1]) { : argument is of length zero

I know this isn't correct, but it's what I have been able to piece together so far. I think this may require some combination of "while" and "for" loops, but I am stuck. Maybe a package already exists with a function that can do this?


Solution

  • You can do this leveraging run-length id:

    library(data.table)
    
    setDT(test)[, .(start=min(start), end=max(end), abs.sum=min(abs.sum)), .(chr,rleid(abs.sum))][,-2]
    
    

    Output:

         chr start   end abs.sum
       <num> <num> <num>   <num>
    1:     1     1  3000       2
    2:     1  3001  4000       1
    3:     1  4001  6000       3
    4:     2     1  1000       2
    5:     2  1001  2000       1
    6:     2  2001  3000       2
    7:     2  3001  5000       1
    

    Here is another option using dplyr (but note I retain the use of data.table:rleid. Thanks to @LeroyTyrone for pointing out the consecutive_id() function)

    library(dplyr)
    
    test %>% 
      group_by(chr, id=consecutive_id(abs.sum)) %>% 
      summarize(start=min(start), end=max(end), abs.sum=min(abs.sum)) %>% 
      select(-id)
    

    Output:

        chr start   end abs.sum
      <dbl> <dbl> <dbl>   <dbl>
    1     1     1  3000       2
    2     1  3001  4000       1
    3     1  4001  6000       3
    4     2     1  1000       2
    5     2  1001  2000       1
    6     2  2001  3000       2
    7     2  3001  5000       1