Resample (bootstrap) data from matrix with x draws per row and y draws per column

I have a matrix with zero's and one's. ~30% of the sample are 1s, I want to estimate a confidence-interval around this percentage (e.g., "if I sampled the whole population there would be likely 28-32% "1"s). For doing so you can bootstrap from the sample, (redraw the sample N times from itself with replacement and analyze the distribution of the percentage of 1s over all redrawn samples). However my data is nested (highly correlated) within rows and within columns. I tried out whether this nestedness makes a difference (since I have dichotmous variables I can use rflip() which simulates a biased coinflip), it does:

library("mosaic")

#### data example ####
c1<-c(1,1,1,1,1,0,0,0,0,0) # high probability for "1"
c2<-c(1,0,0,0,0,0,0,0,0,0) # low probability for "1"
d<-cbind.data.frame(c1,c2)

#### a) resample over entire data ####
b<-vector()
for (i in 1:10000){
b[i] <- rflip(20,      # Flip 20 times,  
              6/20)/   # Probability for "1": 6/20, i.e., probability for "0": 14/20
        20             # divide by 20 to return relative frequency
}

mean(b)# returns 0.3007955     # mean over 10000 replications: close to 6/20
sd(b)  # returns 0.1024339     # standard deviation important to compute confidence interval

#### b) resample per column ####
b1 <- vector()
b2 <- vector()
bt <- vector()
for (i in 1:10000){
  b1[i] <-  rflip(10,(5/10)) # Flip 10 times with probablility for c1
  b2[i] <-  rflip(10,(1/10)) # Flip 10 times with probablility for c2
  bt[i] <- (b1[i]+b2[i])/20  # sum up all 20 flips and divide by 20 to return relative frequency
}
mean(bt)# returns 0.3001475   # mean similar to a)
sd(bt)  # returns 0.09214384  # standard deviation smaller than a)

When I redraw 10 times from column c1 and 10 times from column c2 and replicate this process 10,000 times the distribution of the observed probabilities is more narrow than when I sample 20 times from the entire data. If the probability for "1" is identical in both columns approaches a) and b) lead to the same standard deviation.

I now want to consider not only the columns but also the rows, e.g. I want to draw 10 times from column 1 and 10 times from column 2 and I want to constrain that among these 20 draws there must be two draws per row. My first idea would be:

forloop{

Randomize order of columns
draw 10 times from column 1 but constrain that there are maximum 2 redraws per row
draw 10 times from column 2 but constrain that the redraws from column 1 plus the redraws from column 2 are maximum 2 redraws per row (if we had 2 redraws from row 1 for column 1, no redraws from row 1 for column 2)

}

Does anybody have an idea about how to do that or has anybody got a better idea? Must probably be a different function than rflip(). Would help me a lot!

Thanks, ajj

Solution

Take a look at r2dtable.

nrows <- 10L
ncols <- 6L
nr <- rep(ncols, nrows)
nc <- rep(nrows, ncols)

m <- r2dtable(1, nr, nc)[[1]]
m
#>       [,1] [,2] [,3] [,4] [,5] [,6]
#>  [1,]    2    0    1    1    1    1
#>  [2,]    2    1    1    0    1    1
#>  [3,]    0    1    1    2    2    0
#>  [4,]    0    3    1    1    0    1
#>  [5,]    1    2    1    0    1    1
#>  [6,]    0    0    1    2    0    3
#>  [7,]    3    0    1    1    0    1
#>  [8,]    2    0    0    1    3    0
#>  [9,]    0    0    1    2    1    2
#> [10,]    0    3    2    0    1    0
rowSums(m)
#>  [1] 6 6 6 6 6 6 6 6 6 6
colSums(m)
#> [1] 10 10 10 10 10 10

I want to draw 10 times from column 1 and 10 times from column 2 and I want to constrain that among these 20 draws there must be two draws per row.

That would be:

nrows <- 10L
ncols <- 2L
nr <- rep(ncols, nrows)
nc <- rep(nrows, ncols)

m <- r2dtable(1, nr, nc)[[1]]
m
#>       [,1] [,2]
#>  [1,]    2    0
#>  [2,]    0    2
#>  [3,]    1    1
#>  [4,]    2    0
#>  [5,]    0    2
#>  [6,]    2    0
#>  [7,]    1    1
#>  [8,]    1    1
#>  [9,]    0    2
#> [10,]    1    1

And you are correct about a smaller SD when the row/column counts are constrained when resampling:

broadcast <- Rcpp::cppFunction(
  "arma::cube broadcast(arma::cube& m, arma::mat& d) {return(m.each_slice() % d);}",
  depends = "RcppArmadillo",
  plugins = "cpp11"
)

c1 <- c(1,1,1,1,1,0,0,0,0,0) # high probability for "1"
c2 <- c(1,0,0,0,0,0,0,0,0,0) # low probability for "1"
d <- cbind(c1, c2)
nr <- nrow(d)
nc <- ncol(d)
nreps <- 1e4L

bt <- colSums(
  broadcast(
    simplify2array(
      r2dtable(
        nreps,
        rep(nc, nr),
        rep(nr, nc)
      )
    ),
    d
  ),
  dims = 2
)/nr/nc
p <- mean(d)
mean(d) # true mean
#> [1] 0.3
mean(bt) # estimated mean
#> [1] 0.300685
sqrt(p*(1 - p)/nr/nc) # expected SD from uniform samples of size nr*nc
#> [1] 0.1024695
sqrt((p*(1 - p) - var(colMeans(d))*(1 - 1/nc))/nr/nc) # expected SD from column-wise resampling
#> [1] 0.09219544
sd(bt) # estimated SD from constrained row and column resampling
#> [1] 0.05604547