Search code examples
rcorrelationstatistics-bootstrap

Bootstrap variables correlation in R


My intention was to write several functions aimed at finding the overall similarity between two covariance matrices, either by multiplying them with random vectors and correlating the response vectors or by bootstrapping one of the matrices to obtain the correlation coefficient distribution that can serve for comparison. But in both cases I got erroneous results. The observed between-matrix correlation was high up to 0.93, but the distribution only ranged up to 0.2 the most. This is the function`s code:

resamplerSimAlt <- function(mat1, mat2, numR, graph = FALSE)
{
  statSim <- numeric(numR)
  mat1vcv <- cov(mat1)
  mat2vcvT <- cov(mat2)
  ltM1 <- mat1vcv[col(mat1vcv) <= row(mat1vcv)]
  ltM2T <- mat2vcvT[col(mat2vcvT) <= row(mat2vcvT)]
  statObs <- cor(ltM1, ltM2T)                           
  indice <- c(1:length(mat2))
  resamplesIndices <- lapply(1:numR, function(i) sample(indice, replace = F))
  for (i in 1:numR)
  {
    ss <- mat2[sample(resamplesIndices[[i]])]
    ss <- matrix(ss, nrow = dim(mat2)[[1]], ncol = dim(mat2)[[2]])
    mat2ss <- cov(ss)
    ltM2ss <- mat2ss[col(mat2ss) <= row(mat2ss)]
    statSim[i] <- cor(ltM1, ltM2ss)
  }
  if (graph == TRUE)
  {
    plot(1, main = "resampled data density distribution", xlim = c(0, statObs+0.1), ylim = c(0,14))
    points(density(statSim), type="l", lwd=2)
    abline(v = statObs)
    text(10, 10, "observed corelation = ")
  }
  list( obs = statObs , sumFit = sum(statSim > statObs)/numR)
}  

In fact it is hard for me to belive that correlation coefficient between two original matrices is high, and the one between the first original matrix and the second re-sampled one is maximal 0.2 after 10000 bootstrap repetitions.

Any comments on the validity of the code?


Solution

  • Sorry, I am not enough educated to catch up your goals about checking the correlation efficient between two covariance matrices, but I tried to apprehend your code per se.

    If I am right, you are making up 10.000 different matrices from the same matrix (mat2) by reordering the cells all round, and recomputing the correlation between the covariance matrix of mat1 and the covariance matrix of the resampled array. Those are stored in the statSim variable.

    You said the original correaltion efficient was high (statObs), but the maximum of statSim was low, which is strange. I think the problem is with your result list:

    list( obs = statObs , sumFit = sum(statSim > statObs)/numR)
    

    Where you return the original correaltion coefficient (obs), but not the written maximum with sumFit. There you might use eg. max(statSim). I see the point in returning sumFit for checking if the resampling did any improvement to the correlation efficient, but based on your code, I see no problem about the theory.

    Updated function with max of simulated correlation coefficients:

    resamplerSimAlt <- function(mat1, mat2, numR, graph = FALSE)
    {
      statSim <- numeric(numR)
      mat1vcv <- cov(mat1)
      mat2vcvT <- cov(mat2)
      ltM1 <- mat1vcv[col(mat1vcv) <= row(mat1vcv)]
      ltM2T <- mat2vcvT[col(mat2vcvT) <= row(mat2vcvT)]
      statObs <- cor(ltM1, ltM2T)                           
      indice <- c(1:length(mat2))
      resamplesIndices <- lapply(1:numR, function(i) sample(indice, replace = F))
      for (i in 1:numR)
      {
        ss <- mat2[sample(resamplesIndices[[i]])]
        ss <- matrix(ss, nrow = dim(mat2)[[1]], ncol = dim(mat2)[[2]])
        mat2ss <- cov(ss)
        ltM2ss <- mat2ss[col(mat2ss) <= row(mat2ss)]
        statSim[i] <- cor(ltM1, ltM2ss)
      }
      if (graph == TRUE)
      {
        plot(1, main = "resampled data density distribution", xlim = c(0, statObs+0.1), ylim = c(0,14))
        points(density(statSim), type="l", lwd=2)
        abline(v = statObs)
        text(10, 10, "observed corelation = ")
      }
      list( obs = statObs , sumFit = sum(statSim > statObs)/numR, max=max(statSim))
    }
    

    What I had run:

    > mat1 <- matrix(runif(25),5,5)
    > mat2 <- mat1+0.2
    > resamplerSimAlt(mat1, mat2, 10000)
    $obs
    [1] 1
    
    $sumFit
    [1] 0
    
    $max
    [1] 0.94463
    

    And with random mat2:

    > mat2 <- matrix(runif(25),5,5)
    > resamplerSimAlt(mat1, mat2, 10000)
    $obs
    [1] 0.31144
    
    $sumFit
    [1] 0.9124
    
    $max
    [1] 0.9231
    

    My answer might not be a real answer. If that would be the case, please give more details about the problem :)