r performance correlation statistics-bootstrap

How to bootstrap correlation using vectorised function applied to large matrix?

I understand how to boostrap using the "boot" package in R, through the PDF for the package and also from these two examples on Stack, Bootstrapped correlation with more than 2 variables in R and Bootstrapped p-value for a correlation coefficient on R.

However, this is for small datasets ( 2 variables or a matrix with 5 variables). I have a very large matrix (1000+ columns) and the code I use to compute the correlation between every metabolite pair (removing duplicate and correlations with the metabolite itself) is:

  x <- colnames(dat)
 GetCor = function(x,y) cor(dat[,x], dat[,y], method="spearman")  
 GetCor = Vectorize(GetCor)


 out <- data.frame(t(combn(x,2)), stringsAsFactors = F) %>%
  mutate(v = GetCor(X1,X2))

I'm not sure how I can then alter this for it to be the function I pass to statistic in boot so

 boot_res<- boot(dat, ?, R=1000)

Or would I just need to obtain a matrix of the bootstrapped p value or estimate depending on function code (colMeans(boot_res$t)) and get rid of the upper or lower triangle?

Was curious to know the most efficient way of going about the problem..

Solution

You may want to use cor.test to get theoretical t-values. We will use them for comparison with the B bootstrap t-values. (Recall: The p-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct.)

Here is a similar function to yours, but applying cor.test and extracting statistics.

corr_cmb <- \(X, boot=FALSE) {
  stts <- c('estimate', 'statistic', 'p.value')
  cmbn <- combn(colnames(X), 2, simplify=FALSE)
  a <- lapply(cmbn, \(x) as.data.frame(cor.test(X[, x[1]], X[, x[2]])[stts])) |> 
    do.call(what=rbind) |>
    `rownames<-`(sapply(cmbn, paste, collapse=':'))
  if (boot) {
    a <- a[, 'statistic']
  }
  a
}

We run it one times on the data to get a theoretical solution.

rhat <- corr_cmb(dat)

head(rhat, 3)
#          estimate  statistic    p.value
# V1:V2  0.06780426  2.1469547 0.03203729
# V1:V3  0.03471587  1.0973752 0.27274212
# V1:V4  0.05301563  1.6771828 0.09381987

Bootstrap

We can assume from the start that the bootstrap with 1000 columns will run for a while (choose(1000, 2) returns 499500 combinations). That's why we think about a multithreaded solution right away.

To bootstrap we simple repeatedly apply corr_cmb repeatedly on a sample of the data with replications.

We will measure the time to estimate the time needed for 1000 variables.

## setup clusters
library(parallel)
CL <- makeCluster(detectCores() - 1)
clusterExport(CL, c('corr_cmb', 'dat'))

t0 <- Sys.time()  ## timestamp before run

B <- 1099L
clusterSetRNGStream(CL, 42)
boot_res <- parSapply(CL, 1:B, \(i) corr_cmb(dat[sample.int(nrow(dat), replace=TRUE), ], boot=TRUE))

t1 <- Sys.time()  ## timestamp after run

stopCluster(CL)

After the bootstrap, we calculate the ratios of how many times the absolute bootstrap test statistics exceeded the theoretical ones (Ref.),

boot_p <- rowMeans(abs(boot_res - rowMeans(boot_res)) > abs(rhat$statistic))

and cbind the bootstrap p-values to the theoretical result.

cbind(rhat, boot_p)
#          estimate  statistic    p.value     boot_p
# V1:V2  0.06780426  2.1469547 0.03203729 0.03003003
# V1:V3  0.03471587  1.0973752 0.27274212 0.28028028
# V1:V4  0.05301563  1.6771828 0.09381987 0.08208208
# V1:V5 -0.01018682 -0.3218300 0.74764890 0.73473473
# V2:V3  0.03730133  1.1792122 0.23859474 0.23323323
# V2:V4  0.07203911  2.2817257 0.02271539 0.01201201
# V2:V5  0.03098230  0.9792363 0.32770055 0.30530531
# V3:V4  0.02364486  0.7471768 0.45513283 0.47547548
# V3:V5 -0.02864165 -0.9051937 0.36558126 0.38938939
# V4:V5  0.03415689  1.0796851 0.28054328 0.29329329

Note that the data used is fairly normally distributed. If the data is not normally distributed, the bootstrap p-values will be more different.

To conclude, an estimate of the time needed for your 1000 variables.

d <- as.numeric(difftime(t1, t0, units='mins'))
n_est <- 1000
t_est <- d/(choose(m, 2))*choose(n_est, 2)
cat(sprintf('est. runtime for %s variables: %s mins\n', n_est, round(t_est, 1)))
# est. runtime for 1000 variables: 1485.8 mins

(Perhaps for sake of completeness, a single-threaded version for smaller problems:)

## singlethreaded version
# set.seed(42)
# B <- 1099L
# boot_res <- replicate(B, corr_cmb(dat[sample.int(nrow(dat), replace=TRUE), ], boot=TRUE))

Data:

library(MASS)
n <- 1e3; m <- 5
Sigma <- matrix(.5, m, m)
diag(Sigma) <- 1
set.seed(42)
M <- mvrnorm(n, runif(m), Sigma)
M <- M + rnorm(length(M), sd=6)
dat <- as.data.frame(M)