Search code examples
rstatistics-bootstrap

does boot package in r, use the first return(result) as the observed data to calculate confidence intervals


I am using the function boot in R to do a bootstrap, but instead of passing my dataset directly as the data parameter in the boot function, I pass an index that is used inside the statistic to merge two data tables to get my result. It seems that boot uses the result of the first bootstrap as the real sampled data (say the empirical value). Is this correct? Because when I do the bootstrap manually I get similar results. Although I would expect boot to use 'data' as the original data. I am confused. The CI make sense but I would expect it not to work, unless for the reason I have mentioned.

In short, I have an index vector

x=1:100

and my function

myboot <- function(data,indeces) {
  toselect <- data[indeces] # allows boot to select sample
  toselect=as.data.table(toselect)
  #this is where I use the index for the merge
  t=merge(toselect,mydataset,allow.cartesian=TRUE)
  return(nrow(t))
}
b <- boot(data=x, statistic=myboot, R=1000)

The results I get

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:
boot(data = x, statistic = myboot, R = 1000)

Bootstrap Statistics :
    original      bias    std. error
t1* 397.2477 -0.03669725    11.70803
> boot.ci(b, type="bca")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = b, type = "bca")

Intervals : 
Level       BCa          
95%   (375.2, 421.1 )  

Solution

  • Yes you are correct.

    The function used to compute the statistic has the following requirement (according to the help page):

    ... In all other cases statistic must take at least two arguments. The first argument passed will always be the original data. The second will be a vector of indices, frequencies or weights which define the bootstrap sample. Further, if predictions are required, then a third argument is required which would be a vector of the random indices used to generate the bootstrap predictions.

    Since your dataset consists of the numbers from 1:100 then the second argument passed will sample from 1:100 and will end up producing the exact same result. In other words your data[indeces] line will be identical to indeces.