Search code examples
rstringdataframegroup-summaries

compress / summarize string start and length data in R


I have a data.frame of (sub)string positions within a larger string. The data contains the start of a (sub)string and it's length. The end position of the (sub)string can be easily calculated.

data1 <- data.frame(start = c(1,3,4,9,10,13),
                   length = c(2,1,3,1,2,1)
                   )

data1$end <- (data1$start + data1$length - 1)

data1
#>   start length end
#> 1     1      2   2
#> 2     3      1   3
#> 3     4      3   6
#> 4     9      1   9
#> 5    10      2  11
#> 6    13      1  13

Created on 2019-12-10 by the reprex package (v0.3.0)

I would like to 'compress' this data.frame by summarizing continuous (sub)strings (strings that are connected with each other) so that my new data looks like this:

data2 <- data.frame(start = c(1,9,13),
                   length = c(6,3,1)
                   )

data2$end <- (data2$start + data2$length - 1)

data2
#>   start length end
#> 1     1      6   6
#> 2     9      3  11
#> 3    13      1  13

Created on 2019-12-10 by the reprex package (v0.3.0)

Is there preferably a base R solution which gets me from data1 to data2?


Solution

  • f = cumsum(with(data1, c(0, start[-1] - head(end, -1))) != 1)
    do.call(rbind, lapply(split(data1, f), function(x){
        with(x, data.frame(start = start[1],
                           length = tail(end, 1) - start[1] + 1,
                           end = tail(end, 1)))}))
    #  start length end
    #1     1      6   6
    #2     9      3  11
    #3    13      1  13