Search code examples
rdata.tablestatistics-bootstrap

Bootstrapping with replacement by group, but creating a new identifier for resampled units


I am trying to bootstrap groups from a data table with replacement in R.

This is the data table for example:

dat = data.table('n'=c(1,1,1,2,2,2,2,3,4,4,4,4,4),'y'=round(rnorm(13,0,1),1))


   n    y
 1: 1 -0.8
 2: 1  0.5
 3: 1 -0.1
 4: 2  0.2
 5: 2 -0.1
 6: 2 -2.7
 7: 2  0.1
 8: 3  0.3
 9: 4 -0.7
10: 4 -0.2
11: 4  1.2
12: 4  1.2
13: 4 -0.1

A bootstrapped sample randomly draws 4 groups of 'n', so the result might be something like this (where in this realization, group 1,4 were drawn, and 3 was drawn twice):

   n    y
 1: 4 -0.7
 2: 4 -0.2
 3: 4  1.2
 4: 4  1.2
 5: 4 -0.1
 6: 3  0.3
 7: 3  0.3
 8: 1 -0.8
 9: 1  0.5
10: 1 -0.1

However, my problem is that now if I group by 'n', it thinks rows 6 and 7 are the same group, when in reality they are resampled version, so I want to treat them differently, for example, by adding a third column that says, "this is the SECOND group pulled from 3" (e.g. 3.1 and 3.2) or something that accomplishes that.


Solution

  • You can do this through a join (and quite likely also some other way).

    First we generate a bootstrap sample. This contains two variables: the new group id, bid and the sample group, n

    set.seed(84)
    bootsample = data.table(n=sample(1:4, 4, replace=TRUE), bid=1:4)
    bootsample
    
       n bid
    1: 4   1
    2: 2   2
    3: 4   3
    4: 4   4
    

    Then we need to merge it back to the original data table. Since the groups are repeated we should use the allow.cartesian=TRUE argument. You can use group by the bid variable in subsequent analyses.

    merge(bootsample, dat, allow.cartesian=TRUE)
    
        n bid    y
     1: 2   2  1.1
     2: 2   2  2.2
     3: 2   2 -0.8
     4: 2   2 -1.4
     5: 4   1 -1.3
     6: 4   1 -0.4
     7: 4   1 -1.0
     8: 4   1  0.9
     9: 4   1 -0.3
    10: 4   3 -1.3
    11: 4   3 -0.4
    12: 4   3 -1.0
    13: 4   3  0.9
    14: 4   3 -0.3
    15: 4   4 -1.3
    16: 4   4 -0.4
    17: 4   4 -1.0
    18: 4   4  0.9
    19: 4   4 -0.3
    

    A more compact solution might be possible. Please note, that bootstrapping groups might give you all kinds of problems if they are not the same size depending on how you use the bootstrapped data.