Search code examples
rsequencelarge-data

How to identify all sequential numbers not covered by 'to' and 'from' positions?


I have a data table that defines the start and end coordinates for a set of sequnces. For example:

df1 <- data.frame(from = c(7, 22, 35, 21, 50),
              to = c(13, 29, 43, 31, 60))

Given start and end coordinates (ie 1 and 100), I am trying to identify all integers not covered by the sequences, with the same output format. For example:

df2 <- data.frame(from = c(1, 14, 32, 44, 61),
              to = c(6, 20, 34, 49, 100))

Here is my current attempt, in which I vectorise the sequences in df1 and then identify all integers that do not match the sequence 1:100.

seq2 <- Vectorize(seq.default, vectorize.args = c("from", "to"))
seq <- c(1:100)
df1_int <- unlist(seq2(from = df1$from, to = df1$to))
df1_int <- unique(df1_int)
df2_int <- seq[!seq %in% df1_int]
all(diff(df2_int) == 1)

However, this method is too slow for the dataset I want to apply it to (~100,000,000 integers), and I do not know how to reformat the vector df2_int into a dataframe in the format of df2.

Any help will be greatly appreciated!

NB: The sequences in df1 do not always start with the lowest integer (eg a sequence could run from 13 to 7, as opposed to from 7 to 13). There could also be sequences with only one integer (eg from 7 to 7).


Solution

  • Since you need a fast solution we could attempt a base R approach using setdiff and split. The vectorization we leave to mapply. To find the factors where to split we use findInterval. To get the elements' start and end points of the resulting list we clear with range.

    d <- setdiff(1:100, unlist(mapply(seq.default, df1[, 1], df1[, 2])))
    t(sapply(split(d, findInterval(d, d[which(c(1, diff(d)) > 1)])), range))
    #   [,1] [,2]
    # 0    1    6
    # 1   14   20
    # 2   32   34
    # 3   44   49
    # 4   61  100
    

    Benchmark

    As we can see from the benchmark, we have achieved a pretty fast solution.

    Unit: microseconds
             expr      min        lq      mean    median       uq      max neval cld
            purrr 1575.479 1593.2110 1634.3573 1604.9475 1634.033 2028.095   100   b
     findInterval  250.801  256.9245  276.8609  273.3815  281.673  498.285   100  a