Search code examples
rstatistics-bootstrap

help me improve my bootstrap


Consider the following code:

require(Hmisc)
num.boots <- 10
data <- rchisq(500, df = 5) #generate fake data

#create bins
binx <- cut(data, breaks = 10)
binx <- levels(binx)
binx <- sub("^.*\\,", "", binx)
binx <- as.numeric(substr(binx, 1, nchar(binx) - 1))

#pre-allocate a matrix to be filled with samples
output <- matrix(NA, nrow = num.boots, ncol = length(binx)) 

#do random sampling from the vector and calculate percent
# of values equal or smaller to the bin number (i)
for (i in 1:num.boots) {
    walk.pair.sample <- sample(data, size = length(data), replace = TRUE)
    data.cut <- cut2(x = walk.pair.sample, cuts = binx)
    data.cut <- table(data.cut)/sum(table(data.cut))
    output[i, ] <- data.cut
}

#do some plotting
plot(1:10, seq(0, max(output), length.out = nrow(output)), type = "n", xlab = "", ylab = "")

for (i in 1:nrow(output)) {
    lines(1:10, output[i, 1:nrow(output)])
}

#mean values by columns
output.mean <- apply(output, 2, mean)
lines(output.mean, col="red", lwd = 3)
legend(x = 8, y = 0.25, legend = "mean", col = "red", lty = "solid", lwd = 3)

I was wondering if I can supply the boot:boot() function a function that has as its output a vector of length n > 1? Is it at all possible?

Here are my feeble attempts, but I must be doing something wrong.

require(boot)
bootstrapDistances <- function(data, binx) {
    data.cut <- cut2(x = data, cuts = binx)
    data.cut <- table(data.cut)/sum(table(data.cut))
    return(data.cut)
}

> x <- boot(data = data, statistic = bootstrapDistances, R = 100)
Error in cut.default(x, k2) : 'breaks' are not unique

I don't really understand why Hmisc::cut2() isn't working properly in the boot() call, but works when I call it in a for() loop (see code above). Is the logic of my bootstrapDistances() function feasible with boot()? Any pointers much appreciated.

.:EDIT:.

Aniko suggested I modify my function in such a way, to include an index. While reading the documentation for boot(), this wasn't clear to me how it works, which explains why the function may not be working. Here's the new function Aniko suggested:

bootstrapDistances2 <- function(data, idx, binx) { 
  data.cut <- cut2(x = data[idx], cuts = binx) 
  data.cut <- table(data.cut)/sum(table(data.cut)) 
  return(data.cut) 
} 

However, I managed to produce an error and I'm still working how to remove it.

> x <- boot(data = data, statistic = bootstrapDistances2, R = 100, binx = binx)
Error in t.star[r, ] <- statistic(data, i[r, ], ...) : 
  number of items to replace is not a multiple of replacement length

After I restarted my R session (also tried another version, 2.10.1), it seems to be working fine.


Solution

  • From the help-file for the boot function:

    In all other cases statistic must take at least two arguments. The first argument passed will always be the original data. The second will be a vector of indices, frequencies or weights which define the bootstrap sample.

    So you need to add a second parameter to your bootstrapDistances function that will tell it which elements of the data are selected:

    bootstrapDistances2 <- function(data, idx, binx) { 
      data.cut <- cut2(x = data[idx], cuts = binx) 
      data.cut <- table(data.cut)/sum(table(data.cut)) 
      return(data.cut) 
    } 
    

    And the results:

    x <- boot(data = data, statistic = bootstrapDistances2, R = 100, binx=binx)
    x
    
    ORDINARY NONPARAMETRIC BOOTSTRAP
    
    
    Call:
    boot(data = data, statistic = bootstrapDistances2, R = 100, binx = binx)
    
    
    Bootstrap Statistics :
         original   bias    std. error
    t1*     0.208  0.00134 0.017342783
    t2*     0.322  0.00062 0.021700803
    t3*     0.190 -0.00034 0.018873433
    t4*     0.136 -0.00116 0.016206197
    t5*     0.078 -0.00120 0.011413265
    t6*     0.036  0.00070 0.008510837
    t7*     0.016  0.00074 0.005816417
    t8*     0.006  0.00024 0.003654581
    t9*     0.000  0.00000 0.000000000
    t10*    0.008 -0.00094 0.003368961