I want to generate missing values in a vector so that the missing value are grouped in sequences, to simulate periods of missing data of different length.
Let's say I have a vector of 10 000 values and I want to generate 12 sequences of NA at random locations in the vector, each sequence having a random length L
between 1 and 144 (144 simulates 2 days of missing values at timestep 10 minutes). The sequences must not overlap.
How can I do that? Thanks.
I tried combining lapply
and seq
without success.
An example expected output with 3 distinct sequences:
# 1 2 3 5 2 NA NA 5 4 6 8 9 10 11 NA NA NA NA NA NA 5 2 NA NA NA...
EDIT
I'm dealing with a seasonal time series so the NA must overwrite values and not be inserted as new elements.
If both the starting position and the run-length of each NA-sequence is supposed to be random I think you cannot be sure to immediately find a fitting solution, since your constraint is that the sequences must not overlap.
Therefore I propose the following solution which tries up to a limited number of times (max_iter
) to find a fitting combination of starting positions and NA-run-lengths. If one is found, it is returned, if none is found within the defined maximum number of iterations, you'll just get a notice returned.
x = 1:1000
n = 3
m = 1:144
f <- function(x, n, m, max_iter = 100) {
i = 0
repeat {
i = i+1
idx <- sort(sample(seq_along(x), n)) # starting positions
dist <- diff(c(idx, length(x))) # check distance inbetween
na_len <- sample(m, n, replace = TRUE) - 1L # lengths of NA-runs
ok <- all(na_len < dist) # check overlap
if(ok | i == max_iter) break
}
if(ok) {
replace(x, unlist(Map(":", idx, idx+na_len)), NA)
} else {
cat("no solution found in", max_iter, "iterations")
}
}
f(x, n, m, max_iter = 20)
Of course you can increase the number of iterations easily and you should note that with larger n
it's increasingly difficult (more iterations required) to find a solution.