I have a data table that defines the start and end coordinates for a set of sequnces. For example:
df1 <- data.frame(from = c(7, 22, 35, 21, 50),
to = c(13, 29, 43, 31, 60))
Given start and end coordinates (ie 1 and 100), I am trying to identify all integers not covered by the sequences, with the same output format. For example:
df2 <- data.frame(from = c(1, 14, 32, 44, 61),
to = c(6, 20, 34, 49, 100))
Here is my current attempt, in which I vectorise the sequences in df1 and then identify all integers that do not match the sequence 1:100.
seq2 <- Vectorize(seq.default, vectorize.args = c("from", "to"))
seq <- c(1:100)
df1_int <- unlist(seq2(from = df1$from, to = df1$to))
df1_int <- unique(df1_int)
df2_int <- seq[!seq %in% df1_int]
all(diff(df2_int) == 1)
However, this method is too slow for the dataset I want to apply it to (~100,000,000 integers), and I do not know how to reformat the vector df2_int into a dataframe in the format of df2.
Any help will be greatly appreciated!
NB: The sequences in df1 do not always start with the lowest integer (eg a sequence could run from 13 to 7, as opposed to from 7 to 13). There could also be sequences with only one integer (eg from 7 to 7).
Since you need a fast solution we could attempt a base R approach using setdiff
and split
. The vectorization we leave to mapply
. To find the factors where to split
we use findInterval
. To get the elements' start and end points of the resulting list we clear with range
.
d <- setdiff(1:100, unlist(mapply(seq.default, df1[, 1], df1[, 2])))
t(sapply(split(d, findInterval(d, d[which(c(1, diff(d)) > 1)])), range))
# [,1] [,2]
# 0 1 6
# 1 14 20
# 2 32 34
# 3 44 49
# 4 61 100
Benchmark
As we can see from the benchmark, we have achieved a pretty fast solution.
Unit: microseconds
expr min lq mean median uq max neval cld
purrr 1575.479 1593.2110 1634.3573 1604.9475 1634.033 2028.095 100 b
findInterval 250.801 256.9245 276.8609 273.3815 281.673 498.285 100 a