Search code examples
rrandomgroup-bystatistics-bootstrapsample-size

Stratified Bootstrapping in R with >25 strata


I have data with about 25 different groups. In an effort to see how the variance of each group would change if I had different sample sizes I am trying to do stratified bootstraping. For example at sample size 5, it should produce 1000 collections of 5 resampled points for each group. I like to collect smallest sample size as necessary in possible range of 5 to 30 per group.

The problem I am running into is that I have to subset each group and run the bootstrapping on individual groups then copy and past the R output into excel. (I am fairly green at R and how to code). It takes too long. I need to automate the bootstrapping to recognize groups and somehow save a statistic of the collection of 1000 groups, to a dataframe. Does this make sense?

Here is what code I have so far:....

#sample data
set.seed(1234)
df <- data.frame(g.name = as.factor(sample(c(LETTERS),100, replace = T)),
            C.H = as.numeric(sample(c(1:9),100, replace=T)))

#subset data by group... here only a three examples
Agroup=subset(df,C.H=='A')
Bgroup=subset(df,C.H=='B')
Cgroup=subset(df,C.H=='C')

#Bootstrap selecting a sample size of "i", "B" number of times. i.e. I am 
selecting sample sizes from 5 to 30, 1000 times each. I then apply var() to 
the sample, and take the multiple variances(or the variance of the 
variances). C.H is the measurement ranging from 1 to 9.  

B=1000
cult.var=(NULL)
for (i in 5:30){
 boot.samples=matrix(sample(Agroup$C.H,size=B*i, 
replace=TRUE),B,i)
  cult.var[i]=var(apply(boot.samples,1,var))
}
print(cult.var)

This works but it is a lot of copy and paste. I think I need to use either a for loop to do the bootstrapping by group or figure something else out. I did find a way to do a stratified sampling all by itself without bootstrapping. So maybe I could figure out how to repeat that 1000 times somehow...

The example here using the function boot() does not fit my situation. I have fiddled with it a little to no avail. I am not sure how to write functions which may also be why I can not figure it out.


Solution

  • Here's a stab at it...

    # generating data
    set.seed(1234)
    df <- data.frame(g.name = as.factor(sample(c(LETTERS),100, replace = T)),
                     C.H = as.numeric(sample(c(1:9),100, replace=T)))
    
    boot.samples <- with(df, tapply(C.H, g.name, function(x) lapply(5:30, function(i) replicate(1000, sample(x,size=i,replace=T)))))
    
    str(boot.samples$A)
    ## List of 26
    ##  $ : num [1:5, 1:1000] 7 7 3 7 7 7 3 3 2 7 ...
    ##  $ : num [1:6, 1:1000] 7 2 2 2 3 7 7 2 2 7 ...
    ##  $ : num [1:7, 1:1000] 2 3 2 7 2 3 7 2 3 3 ...
    ##  $ : num [1:8, 1:1000] 7 7 3 3 3 7 2 7 7 3 ...
    ##  $ : num [1:9, 1:1000] 2 2 2 7 2 7 3 3 3 7 ...
    ## ...and so on
    
    variances <- lapply(boot.samples, function(y) sapply(y, function(x) apply(x, 2, var)))
        str(variances)
    ## List of 26
    ##  $ A: num [1:1000, 1:26] 3.2 5.8 6.2 3.2 0.3 4.8 5 5.8 6.7 3.2 ...
    ##  $ B: num [1:1000, 1:26] 3.2 0.8 4.7 5.3 5.3 5.3 1.2 4.7 4.2 3.8 ...
    ##  $ C: num [1:1000, 1:26] 9 4.8 2.7 9.8 8.3 9.8 10.2 10.2 9 12.3 ...
    ##  $ D: num [1:1000, 1:26] 8.3 7.5 9.8 3.8 3.5 3.5 5.7 3.7 6.7 3.2 ...
    ## ...and so on
    
    variancesvariances <- lapply(variances, function(x) apply(x, 2, var))
    str(variancesvariances)
    ## List of 26
    ##  $ A: num [1:26] 3.15 2.27 1.53 1.3 1.03 ...
    ##  $ B: num [1:26] 4.32 3.54 2.83 2.46 2.09 ...
    ##  $ C: num [1:26] 13.06 10.08 8.46 6.98 5.59 ...
    ##  $ D: num [1:26] 4.9 3.7 3.02 2.39 2.07 ...
    ## ...and so on
    

    Seems to go down with increasing sample size as advertised ... let's make a pretty picture

    cols <- rainbow(26)
    plot(NA, xlim=c(1,26), ylim=c(0,max(unlist(variancesvariances))))
    for(i in 1:26) {
      lines(variancesvariances[[i]], col=cols[i])
      text(1,variancesvariances[[i]][1],names(variancesvariances)[i],col=cols[i])
    }
    

    Note that this can be converted to a data.frame with as.data.frame(variancesvariances).

    Did I get it this time?