Search code examples
rdataframenamissing-data

Randomly insert NAs into dataframe proportionaly


I have a complete dataframe. I want to 20% of the values in the dataframe to be replaced by NAs to simulate random missing data.

A <- c(1:10)
B <- c(11:20)
C <- c(21:30)
df<- data.frame(A,B,C)

Can anyone suggest a quick way of doing that?


Solution

  • df <- data.frame(A = 1:10, B = 11:20, c = 21:30)
    head(df)
    ##   A  B  c
    ## 1 1 11 21
    ## 2 2 12 22
    ## 3 3 13 23
    ## 4 4 14 24
    ## 5 5 15 25
    ## 6 6 16 26
    
    as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
    ##     A  B  c
    ## 1   1 11 21
    ## 2   2 12 22
    ## 3   3 13 23
    ## 4   4 14 24
    ## 5   5 NA 25
    ## 6   6 16 26
    ## 7  NA 17 27
    ## 8   8 18 28
    ## 9   9 19 29
    ## 10 10 20 30
    

    It's a random process, so it might not give 15% every time.