Search code examples
rrandomnamissing-datasample

Why does R 'sample' some columns more than others?


I am testing the impact of missing data on regression analysis. So, using a simulated dataset, I want to randomly remove a proportion of observations (not entire rows) from a designated set of columns. I am using 'sample' to do this. Unfortunately, this is making some columns have much more missing values than others. See an example below:

#Data frame with 5 columns, 10 rows
DF = data.frame(A = paste(letters[1:10]),B = rnorm(10, 1, 10), C = rnorm(10, 1, 10), D = rnorm(10, 1, 10), E = rnorm(10,1,10))

#Function to randomly delete a proportion (ProportionRemove) of records per column, for a designated set of columns (ColumnStart - ColumnEnd)
RandomSample = function(DataFrame,ColumnStart, ColumnEnd,ProportionRemove){
  #ci is the opposite of the proportion
  ci = 1-ProportionRemove
  Missing = sapply(DataFrame[(ColumnStart:ColumnEnd)], function(x) x[sample(c(TRUE, NA), prob = c(ci,ProportionRemove), size = length(DataFrame), replace = TRUE)])}

#Randomly sample column 2 - 5 within DF, deleting 80% of the observation per column
Test = RandomSample(DF, 2, 5, 0.8)

I understand there is an element of randomness to this, but in 10 trials (10*4 = 40 columns), 17 of the columns had no data, and in one trial, one column still had 6 records (rather than the expected ~2) - see below.

       B         C         D  E
 [1,] NA 24.004402  7.201558 NA
 [2,] NA        NA        NA NA
 [3,] NA  4.029659        NA NA
 [4,] NA        NA        NA NA
 [5,] NA 29.377632        NA NA
 [6,] NA  3.340918 -2.131747 NA
 [7,] NA        NA        NA NA
 [8,] NA 15.967318        NA NA
 [9,] NA        NA        NA NA
[10,] NA -8.078221        NA NA 

In summary, I want to replace a propotion of observations with NAs in each column.

Any help is greatly appreciated!!!


Solution

  • This makes sense to me. As @Frank suggested (in a since-deleted comment ... *sigh*), "randomness" can give you really non-random-looking results (Dilbert: Tour of Accounting, 2001-10-25).

    If you want random samples with guaranteed ratios, try this:

    guaranteedSampling <- function(DataFrame, ProportionRemove) {
      n <- max(1L, floor(nrow(DataFrame) * ProportionRemove))
      inds <- replicate(ncol(DataFrame), sample(nrow(DataFrame), size=n), simplify=FALSE)
      DataFrame[] <- mapply(`[<-`, DataFrame, inds, MoreArgs=list(NA), SIMPLIFY=FALSE)
      DataFrame
    }
    
    set.seed(2)
    guaranteedSampling(DF[2:5], 0.8)
    #           B         C         D        E
    # 1        NA        NA        NA       NA
    # 2        NA        NA        NA       NA
    # 3        NA        NA        NA       NA
    # 4  6.792463 10.582938        NA       NA
    # 5        NA        NA -0.612816       NA
    # 6        NA -2.278758        NA       NA
    # 7        NA        NA        NA 2.245884
    # 8        NA        NA        NA 5.993387
    # 9  7.863310        NA  9.042127       NA
    # 10       NA        NA        NA       NA