Search code examples
rstatistics-bootstrap

R bootstrapping for the two dataframe individual column wise


Want to do Bootstrapping while comparing two dataframe column wise with the different number of rows.

I have two dataframe in which row represent values from experiments and column with the dataset names (data1, data2, data3, data4)

emp.data1 <- data.frame(
    data1 = c(234,0,34,0,46,0,0,0,2.26,0, 5,8,93,56),
    data2 = c(1.40,1.21,0.83,1.379,2.60,9.06,0.88,1.16,0.64,8.28, 5,8,93,56),
    data3 =c(0,34,43,0,0,56,0,0,0,45,5,8,93,56),
    data4 =c(45,0,545,34,0,35,0,35,0,534, 5,8,93,56),
    stringsAsFactors = FALSE
  )
  
emp.data2 <- data.frame(
    data1 = c(45, 0, 0, 45, 45, 53),
    data2 = c(23, 0, 45, 12, 90, 78),
    data3 = c(72, 45, 756, 78, 763, 98),
    data4 = c(1, 3, 65, 78, 9, 45),
    stringsAsFactors = FALSE
  )

I am trying to do bootstrapping(n=1000). Values are selected at random replacement from emp.data1(14 * 4) without change in the emp.data2(6 * 4). For example from emp.data2 first column (data1) select 6 values colSum and from emp.data1(data1) select 6 random non zero values colSum Divide the values and store in temp repeat the same 1000 times and take a median value et the end. like this i want to do it for each column of the dataframe. sample code I am providing which is working fine but i am not able get the non-zero random values for emp.data1

nboot <- 1e3

boot_temp_emp<- c()
n_data1 <- nrow(emp.data1); n_data2 <- nrow(emp.data2)

for (j in seq_len(nboot)) {
  boot <- sample(x = seq_len(n_data1), size = n_data2, replace = TRUE)
  value <- colSums(emp.data2)/colSums(emp.data1[boot,])
  boot_temp_emp <- rbind(boot_temp_emp, value)
}
boot_data<- apply(boot_temp_emp, 2, median)

From the above script i am able get the output but each column emp.data1[boot,] data has zero values and taken sum. I want indivisual ramdomly selected non-zero values column sum so I tried below script not able remove zero values. Not able get desired output please some one help me to correct my script

nboot <- 1e3
boot_temp_emp<- c()

for (i in colnames(emp.data2)){
  for (j in seq_len(nboot)){
        data1=emp.data1[i]
        data2=emp.data2[i]
        n_data1 <- nrow(data1); n_data2 <- nrow(data2)
        boot <- sample(x = seq_len(n_data1), size = n_data2, replace = TRUE)
        value <- colSums(data2[i])/colSums(data1[boot, ,drop = FALSE])
        boot_temp_emp <- rbind(boot_temp_emp, value)
  }
}
boot_data<- apply(boot_temp_emp, 2, median)

Thank you


Solution

  • Here is a solution.
    Write a function to make the code clearer. This function takes the following arguments.

    1. x the input data.frame emp.data1;
    2. s2 the columns sums of emp.data2;
    3. n = 6 the number of vector elements to sample from emp.data1's columns with a default value of 6.

    The create a results matrix, pre-compute the column sums of emp.data2 and call the function in a loop.

    boot_fun <- function(x, s2, n = 6){
      # the loop makes sure ther is no divide by zero
      nrx <- nrow(x)
      repeat{
        i <- sample(nrx, n, replace = TRUE)
        s1 <- colSums(x[i, ])
        if(all(s1 != 0)) break
      }
      s2/s1
    }
    
    set.seed(2022)
    
    nboot <- 1e3
    sums2 <- colSums(emp.data2)
    results <- matrix(nrow = nboot, ncol = ncol(emp.data1))
    
    for(i in seq_len(nboot)){
      results[i, ] <- boot_fun(emp.data1, sums2)
    }
    ratios_medians <- apply(results, 2, median)
    
    old_par <- par(mfrow = c(2, 2))
    for(j in 1:4) {
      main <- paste0("data", j)
      hist(results[, j], main = main, xlab = "ratios", freq = FALSE)
      abline(v = ratios_medians[j], col = "blue", lty = "dashed")
    }
    par(old_par)
    

    Created on 2022-02-24 by the reprex package (v2.0.1)


    Edit

    Following the comments here is a revised version of the bootstrap function. It makes sure there are no zeros in the sampled vectors, before computing their sums.

    boot_fun2 <- function(x, s2, n = 6){
      nrx <- nrow(x)
      ncx <- ncol(x)
      s1 <- numeric(ncx)
      for(j in seq.int(ncx)) {
        repeat{
          i <- sample(nrx, n, replace = TRUE)
          if(all(x[i, j] != 0)) {
            s1[j] <- sum(x[i, j])
            break
          }
        }
      }
      s2/s1
    }
    
    set.seed(2022)
    
    nboot <- 1e3
    sums2 <- colSums(emp.data2)
    results2 <- matrix(nrow = nboot, ncol = ncol(emp.data1))
    
    for(i in seq_len(nboot)){
      results2[i, ] <- boot_fun2(emp.data1, sums2)
    }
    ratios_medians2 <- apply(results2, 2, median)
    
    old_par <- par(mfrow = c(2, 2))
    for(j in 1:4) {
      main <- paste0("data", j)
      hist(results2[, j], main = main, xlab = "ratios", freq = FALSE)
      abline(v = ratios_medians2[j], col = "blue", lty = "dashed")
    }
    par(old_par)
    

    Created on 2022-02-27 by the reprex package (v2.0.1)