Search code examples
rnestedresampling

Resample (bootstrap) data from matrix with x draws per row and y draws per column


I have a matrix with zero's and one's. ~30% of the sample are 1s, I want to estimate a confidence-interval around this percentage (e.g., "if I sampled the whole population there would be likely 28-32% "1"s). For doing so you can bootstrap from the sample, (redraw the sample N times from itself with replacement and analyze the distribution of the percentage of 1s over all redrawn samples). However my data is nested (highly correlated) within rows and within columns. I tried out whether this nestedness makes a difference (since I have dichotmous variables I can use rflip() which simulates a biased coinflip), it does:

library("mosaic")

#### data example ####
c1<-c(1,1,1,1,1,0,0,0,0,0) # high probability for "1"
c2<-c(1,0,0,0,0,0,0,0,0,0) # low probability for "1"
d<-cbind.data.frame(c1,c2)

#### a) resample over entire data ####
b<-vector()
for (i in 1:10000){
b[i] <- rflip(20,      # Flip 20 times,  
              6/20)/   # Probability for "1": 6/20, i.e., probability for "0": 14/20
        20             # divide by 20 to return relative frequency
}

mean(b)# returns 0.3007955     # mean over 10000 replications: close to 6/20
sd(b)  # returns 0.1024339     # standard deviation important to compute confidence interval

#### b) resample per column ####
b1 <- vector()
b2 <- vector()
bt <- vector()
for (i in 1:10000){
  b1[i] <-  rflip(10,(5/10)) # Flip 10 times with probablility for c1
  b2[i] <-  rflip(10,(1/10)) # Flip 10 times with probablility for c2
  bt[i] <- (b1[i]+b2[i])/20  # sum up all 20 flips and divide by 20 to return relative frequency
}
mean(bt)# returns 0.3001475   # mean similar to a)
sd(bt)  # returns 0.09214384  # standard deviation smaller than a)

When I redraw 10 times from column c1 and 10 times from column c2 and replicate this process 10,000 times the distribution of the observed probabilities is more narrow than when I sample 20 times from the entire data. If the probability for "1" is identical in both columns approaches a) and b) lead to the same standard deviation.

I now want to consider not only the columns but also the rows, e.g. I want to draw 10 times from column 1 and 10 times from column 2 and I want to constrain that among these 20 draws there must be two draws per row. My first idea would be:

forloop{

  1. Randomize order of columns
  2. draw 10 times from column 1 but constrain that there are maximum 2 redraws per row
  3. draw 10 times from column 2 but constrain that the redraws from column 1 plus the redraws from column 2 are maximum 2 redraws per row (if we had 2 redraws from row 1 for column 1, no redraws from row 1 for column 2)

}

Does anybody have an idea about how to do that or has anybody got a better idea? Must probably be a different function than rflip(). Would help me a lot!

Thanks, ajj


Solution

  • Take a look at r2dtable.

    nrows <- 10L
    ncols <- 6L
    nr <- rep(ncols, nrows)
    nc <- rep(nrows, ncols)
    
    m <- r2dtable(1, nr, nc)[[1]]
    m
    #>       [,1] [,2] [,3] [,4] [,5] [,6]
    #>  [1,]    2    0    1    1    1    1
    #>  [2,]    2    1    1    0    1    1
    #>  [3,]    0    1    1    2    2    0
    #>  [4,]    0    3    1    1    0    1
    #>  [5,]    1    2    1    0    1    1
    #>  [6,]    0    0    1    2    0    3
    #>  [7,]    3    0    1    1    0    1
    #>  [8,]    2    0    0    1    3    0
    #>  [9,]    0    0    1    2    1    2
    #> [10,]    0    3    2    0    1    0
    rowSums(m)
    #>  [1] 6 6 6 6 6 6 6 6 6 6
    colSums(m)
    #> [1] 10 10 10 10 10 10
    

    I want to draw 10 times from column 1 and 10 times from column 2 and I want to constrain that among these 20 draws there must be two draws per row.

    That would be:

    nrows <- 10L
    ncols <- 2L
    nr <- rep(ncols, nrows)
    nc <- rep(nrows, ncols)
    
    m <- r2dtable(1, nr, nc)[[1]]
    m
    #>       [,1] [,2]
    #>  [1,]    2    0
    #>  [2,]    0    2
    #>  [3,]    1    1
    #>  [4,]    2    0
    #>  [5,]    0    2
    #>  [6,]    2    0
    #>  [7,]    1    1
    #>  [8,]    1    1
    #>  [9,]    0    2
    #> [10,]    1    1
    

    And you are correct about a smaller SD when the row/column counts are constrained when resampling:

    broadcast <- Rcpp::cppFunction(
      "arma::cube broadcast(arma::cube& m, arma::mat& d) {return(m.each_slice() % d);}",
      depends = "RcppArmadillo",
      plugins = "cpp11"
    )
    
    c1 <- c(1,1,1,1,1,0,0,0,0,0) # high probability for "1"
    c2 <- c(1,0,0,0,0,0,0,0,0,0) # low probability for "1"
    d <- cbind(c1, c2)
    nr <- nrow(d)
    nc <- ncol(d)
    nreps <- 1e4L
    
    bt <- colSums(
      broadcast(
        simplify2array(
          r2dtable(
            nreps,
            rep(nc, nr),
            rep(nr, nc)
          )
        ),
        d
      ),
      dims = 2
    )/nr/nc
    p <- mean(d)
    mean(d) # true mean
    #> [1] 0.3
    mean(bt) # estimated mean
    #> [1] 0.300685
    sqrt(p*(1 - p)/nr/nc) # expected SD from uniform samples of size nr*nc
    #> [1] 0.1024695
    sqrt((p*(1 - p) - var(colMeans(d))*(1 - 1/nc))/nr/nc) # expected SD from column-wise resampling
    #> [1] 0.09219544
    sd(bt) # estimated SD from constrained row and column resampling
    #> [1] 0.05604547