Search code examples
rstatistics-bootstrap

Bootstrap two-way large dataset


Possible Duplicate:
Bootstrap a large data set

I would like to bootstrap a large two-way data set which contains multiple column and row variables. I have to preserve both row and column variables. The result should be a list containing a bootstrap of all column variables for each row variable. I am providing the required code to answer my question but i think it is not elegant. I would appreciate a better and faster code. The following is a simplified re-creation of the two-way data set:

rm(list=ls())

data <- 1:72

Create a two way matrix data:

charDataDiff <- matrix(data, nrow=9,ncol=8)
varNames <- c("A", "B", "C","A", "B", "C","A", "B", "C")

Add a character column to the charDataDiff matrix:

charDataDiff <- cbind(varNames ,data.frame(charDataDiff))

Add column names:

colnames(charDataDiff) <- c("patchId","s380","s390","s400","s410","s420","s430","s440","s450")

Separate the data using the row-variable "patchId" as the criteria. This creates three lists: one for each Variable

idColor <-  c("A", "B", "C")

(patchSpectrum <- lapply(idColor, function(idColor) charDataDiff[charDataDiff$patchId==idColor,]))

Created the function sampleBoot to sample the patchSpectrum

sampleBoot <-  function(nbootstrap=2, patch=3){
    return(lapply(1:nbootstrap, function(i)
             {patchSpectrum[[patch]][sample(1:nrow(patchSpectrum[[patch]]),replace=TRUE),]}))}

The list "k" answers my question. However, I think my code is slow for a large data set and large bootstrap. I am only bootstrapping 10 iteration for three row variables. A faster more elegant code is appreciated.

numBoots <- 10
for (i in 1: numBoots)
        k <- lapply(1:3, function(n)
                    do.call(rbind, lapply(sampleBoot(i, n), function(x) apply(x[-1], 2, median))))
k

Solution

  • What I can provide is a clean rewrite of your code:

    1. I broke up long pieces into appropriately named functions, so it reads better,
    2. I got rid of the varNames column, instead used by to split your data as @KenWilliams had suggested in your previous question,
    3. I made use of replicate instead of lapply,
    4. I got rid of the unnecessary for loop I pointed out in my comment.

    It should run faster that way; if not I'd suggest you try profiling it.


    charDataDiff <- matrix(1:72, nrow = 9, ncol = 8)
    colnames(charDataDiff) <- c("s380", "s390", "s400", "s410",
                                "s420", "s430", "s440", "s450")
    
    varNames      <- c("A", "B", "C", "A", "B", "C", "A", "B", "C")
    patchSpectrum <- by(charDataDiff, varNames, data.frame)
    
    sampleOne   <- function(x) x[sample(seq_len(nrow(x)), replace = TRUE), ]
    sampleBoot  <- function(x, n) replicate(n, sampleOne(x), simplify = FALSE)
    applyMedian <- function(l) do.call(rbind, lapply(l, apply, 2, median))
    
    k <- lapply(lapply(patchSpectrum, sampleBoot, n = 10), applyMedian)