Search code examples
rstatistics-bootstrap

Uneven observation length during bootstrap


As a relative beginner to R i am having difficulties. My goal is to bootstrap the individual coefficient of variation and to print that to a new dataframe for further calculations and analysis, eg 1000 bootstraped CVs for each individual based on their own variation in the data. Here is how far I got before I ran into a problem i fail to solve. I have tried to find a solution online including a search here but I fail to find or understand that I have found a solution it even if it most probably is there somewhere. If so please point me towards that direction.

I have a dataset with repeated observations on several individuals, but they do not have the same length of observation as seen in the data below

Subject.id  Moderate
    1   943
    1   1132
    1   347
    1   1100
    1   1265
    2   1297
    2   888
    2   1005
    2   1211
    2   1338
    2   1238
    2   916
    2   541
    2   613
    2   692
    2   1538
    2   1071
    3   670
    3   864
    3   1189
    3   320

I'm trying to bootstrap, using the boot package, the within individual coefficient of variation. My boot function looks like this:

    boot.f<-function(d, i){
  d2 <- d[i,]
  return(sqrt(var(d2$moderate))/mean(d2$moderate))
}

And it runs perfectly fine like this:

boot1<-boot(df, boot.f, 1000)

However, when I try and use the strata argument like this:

boot1<-boot(df, boot.f, 1000, strata=subject.id)

I get the following error message:

Error in tapply(seq_len(n), as.numeric(strata)) : arguments must have same length In addition: Warning message: In tapply(seq_len(n), as.numeric(strata)) : NAs introduced by coercion

So my question is how can I tweak my function so that I can preserve the within subject information and in the end get an output looking something like when I used the summaryBy function, exept times a thousand? summaryBy(moderate~subject_id, data=df, FUN=CV)

   subject.id             moderate.CV
1        2001             0.3831299
2        2002             0.4972260
3        2003             0.5095434
4        2004             0.2730478
5        2005             0.3645640
6        2006             0.3727822
7        2007             0.3858968
8        2008             0.5833114
9        2009             0.5896946
10       2013             0.4247119
11       2014             0.3016552
12       2015             0.4670444
13       2016             0.3995908
14       2018             0.3908963
15       2019             0.3660683
16       2020             0.3373719
17       2022             0.5020418
18       2023             0.3848056
19       2024             0.6410266
20       2025             0.7070671
21       2026             0.3925212
22       2028             0.1879174
23       2029             0.2912984
24       2030             0.3534441
25       2031             0.2238960
26       2032             0.7491192
27       2033             0.5775261

Solution

  • I have no problem running the following:

    library(boot)
    df<-read.table(path.to.your.data)
    boot.f<-function(d, i){
      d2 <- d[i,]
      return(sqrt(var(d2$moderate))/mean(d2$moderate))
    }
    boot(df, boot.f, 1000)
    boot(df, boot.f, 1000, strata=df$subject.id)
    

    variable names (since you change between upper- and lowercase letters):

       head(df,3)
          subject.id moderate
        1          1      943
        2          1     1132
        3          1      347