Search code examples
rvectorrandommissing-dataseq

generate random sequences of NA of random lengths in a vector


I want to generate missing values in a vector so that the missing value are grouped in sequences, to simulate periods of missing data of different length.

Let's say I have a vector of 10 000 values and I want to generate 12 sequences of NA at random locations in the vector, each sequence having a random length L between 1 and 144 (144 simulates 2 days of missing values at timestep 10 minutes). The sequences must not overlap.

How can I do that? Thanks.

I tried combining lapply and seq without success.

An example expected output with 3 distinct sequences:

# 1 2 3 5 2 NA NA 5 4 6 8 9 10 11 NA NA NA NA NA NA 5 2 NA NA NA...

EDIT

I'm dealing with a seasonal time series so the NA must overwrite values and not be inserted as new elements.


Solution

  • If both the starting position and the run-length of each NA-sequence is supposed to be random I think you cannot be sure to immediately find a fitting solution, since your constraint is that the sequences must not overlap.

    Therefore I propose the following solution which tries up to a limited number of times (max_iter) to find a fitting combination of starting positions and NA-run-lengths. If one is found, it is returned, if none is found within the defined maximum number of iterations, you'll just get a notice returned.

    x = 1:1000
    n = 3
    m = 1:144
    
    f <- function(x, n, m, max_iter = 100) {
      i = 0
      repeat {
        i = i+1
        idx <- sort(sample(seq_along(x), n))        # starting positions
        dist <- diff(c(idx, length(x)))             # check distance inbetween 
        na_len <- sample(m, n, replace = TRUE) - 1L # lengths of NA-runs
        ok <- all(na_len < dist)                    # check overlap
        if(ok | i == max_iter) break 
      }
    
      if(ok) {
        replace(x, unlist(Map(":", idx, idx+na_len)), NA)
      } else {
          cat("no solution found in", max_iter, "iterations")
        }
    }
    
    f(x, n, m, max_iter = 20)
    

    Of course you can increase the number of iterations easily and you should note that with larger n it's increasingly difficult (more iterations required) to find a solution.