Search code examples
rperformancecorrelationstatistics-bootstrap

How to bootstrap correlation using vectorised function applied to large matrix?


I understand how to boostrap using the "boot" package in R, through the PDF for the package and also from these two examples on Stack, Bootstrapped correlation with more than 2 variables in R and Bootstrapped p-value for a correlation coefficient on R.

However, this is for small datasets ( 2 variables or a matrix with 5 variables). I have a very large matrix (1000+ columns) and the code I use to compute the correlation between every metabolite pair (removing duplicate and correlations with the metabolite itself) is:

  x <- colnames(dat)
 GetCor = function(x,y) cor(dat[,x], dat[,y], method="spearman")  
 GetCor = Vectorize(GetCor)


 out <- data.frame(t(combn(x,2)), stringsAsFactors = F) %>%
  mutate(v = GetCor(X1,X2))

I'm not sure how I can then alter this for it to be the function I pass to statistic in boot so

 boot_res<- boot(dat, ?, R=1000)

Or would I just need to obtain a matrix of the bootstrapped p value or estimate depending on function code (colMeans(boot_res$t)) and get rid of the upper or lower triangle?

Was curious to know the most efficient way of going about the problem..


Solution

  • You may want to use cor.test to get theoretical t-values. We will use them for comparison with the B bootstrap t-values. (Recall: The p-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct.)

    Here is a similar function to yours, but applying cor.test and extracting statistics.

    corr_cmb <- \(X, boot=FALSE) {
      stts <- c('estimate', 'statistic', 'p.value')
      cmbn <- combn(colnames(X), 2, simplify=FALSE)
      a <- lapply(cmbn, \(x) as.data.frame(cor.test(X[, x[1]], X[, x[2]])[stts])) |> 
        do.call(what=rbind) |>
        `rownames<-`(sapply(cmbn, paste, collapse=':'))
      if (boot) {
        a <- a[, 'statistic']
      }
      a
    }
    

    We run it one times on the data to get a theoretical solution.

    rhat <- corr_cmb(dat)
    
    head(rhat, 3)
    #          estimate  statistic    p.value
    # V1:V2  0.06780426  2.1469547 0.03203729
    # V1:V3  0.03471587  1.0973752 0.27274212
    # V1:V4  0.05301563  1.6771828 0.09381987
    

    Bootstrap

    We can assume from the start that the bootstrap with 1000 columns will run for a while (choose(1000, 2) returns 499500 combinations). That's why we think about a multithreaded solution right away.

    To bootstrap we simple repeatedly apply corr_cmb repeatedly on a sample of the data with replications.

    We will measure the time to estimate the time needed for 1000 variables.

    ## setup clusters
    library(parallel)
    CL <- makeCluster(detectCores() - 1)
    clusterExport(CL, c('corr_cmb', 'dat'))
    
    t0 <- Sys.time()  ## timestamp before run
    
    B <- 1099L
    clusterSetRNGStream(CL, 42)
    boot_res <- parSapply(CL, 1:B, \(i) corr_cmb(dat[sample.int(nrow(dat), replace=TRUE), ], boot=TRUE))
    
    t1 <- Sys.time()  ## timestamp after run
    
    stopCluster(CL)
    

    After the bootstrap, we calculate the ratios of how many times the absolute bootstrap test statistics exceeded the theoretical ones (Ref.),

    boot_p <- rowMeans(abs(boot_res - rowMeans(boot_res)) > abs(rhat$statistic))
    

    and cbind the bootstrap p-values to the theoretical result.

    cbind(rhat, boot_p)
    #          estimate  statistic    p.value     boot_p
    # V1:V2  0.06780426  2.1469547 0.03203729 0.03003003
    # V1:V3  0.03471587  1.0973752 0.27274212 0.28028028
    # V1:V4  0.05301563  1.6771828 0.09381987 0.08208208
    # V1:V5 -0.01018682 -0.3218300 0.74764890 0.73473473
    # V2:V3  0.03730133  1.1792122 0.23859474 0.23323323
    # V2:V4  0.07203911  2.2817257 0.02271539 0.01201201
    # V2:V5  0.03098230  0.9792363 0.32770055 0.30530531
    # V3:V4  0.02364486  0.7471768 0.45513283 0.47547548
    # V3:V5 -0.02864165 -0.9051937 0.36558126 0.38938939
    # V4:V5  0.03415689  1.0796851 0.28054328 0.29329329
    

    Note that the data used is fairly normally distributed. If the data is not normally distributed, the bootstrap p-values will be more different.

    To conclude, an estimate of the time needed for your 1000 variables.

    d <- as.numeric(difftime(t1, t0, units='mins'))
    n_est <- 1000
    t_est <- d/(choose(m, 2))*choose(n_est, 2)
    cat(sprintf('est. runtime for %s variables: %s mins\n', n_est, round(t_est, 1)))
    # est. runtime for 1000 variables: 1485.8 mins
    

    (Perhaps for sake of completeness, a single-threaded version for smaller problems:)

    ## singlethreaded version
    # set.seed(42)
    # B <- 1099L
    # boot_res <- replicate(B, corr_cmb(dat[sample.int(nrow(dat), replace=TRUE), ], boot=TRUE))
    

    Data:

    library(MASS)
    n <- 1e3; m <- 5
    Sigma <- matrix(.5, m, m)
    diag(Sigma) <- 1
    set.seed(42)
    M <- mvrnorm(n, runif(m), Sigma)
    M <- M + rnorm(length(M), sd=6)
    dat <- as.data.frame(M)