Search code examples
rdata.tablelapplysapplysplitstackshape

Incorporating splitstackshape into loop


I have the following code that selects (4 rows of iris x 1000) *100 and calculates the bias of each column.

library(SimDesign)
library(data.table)

do.call(rbind,lapply(1:100, function(x) {
  bias(
    setDT(copy(iris))[as.vector(sapply(1:1000, function(X) sample(1:nrow(iris),4)))][
      , lapply(.SD, mean), by=rep(c(1:1000),4), .SDcols=c(1:4)][,c(2:5)],
    parameter=c(5,3,2,1), #parameter is the true population value used to calculate bias
    type='relative' #denotes the type of bias being calculated 
  )
}))

This takes 1000 samples of 4 rows, calculates the mean by sample #, giving me 1000 means. The bias for the 1000 means is found for each column, and then is done 99 more times giving me a distribution of bias estimates for each column. This is mimicking a random sampling design. However, I also want to do this for a stratified design. So I use splitstackshape's stratified function.

do.call(rbind,lapply(1:100, function(x) {
  bias(
    setDT(copy(iris))[as.vector(sapply(1:1000, function(X) stratified(iris,group="Species", size=1)))][
      , lapply(.SD, mean), by=rep(c(1:1000),4), .SDcols=c(1:4)][,c(2:5)],
    parameter=c(5,3,2,1), 
    type='relative'
  )
}))

I would've thought that it is just a matter of swapping out the functions, but I keep on getting errors (i is invalid type (matrix)). Perhaps in future a 2 column matrix could return a list of elements of DT . I think it might be something related to setDT, but I'm not really sure how to fix it. Anybody know where I'm going wrong?


Solution

  • I've split into a couple of functions for you

    Load data.table, SimDesign, and splitstackshape

    library(SimDesign)
    library(data.table)
    library(splitstackshape)
    

    Function to get n stratified samples of size sampsize and return column means of those samples

    get_samples <- function(n, sampsize=4) {
      rbindlist(lapply(1:n, function(x) {
        splitstackshape::stratified(iris, group="Species",sampsize)[, id:=x]
      }))[, lapply(.SD, mean), by=.(Species, id)]
    }
    

    Now, lets get the distribution of bias across y such iterations of these samples

    get_bias_distribution <- function(y=100, samples_per_iter=50, size_per_iter=4) {
      rbindlist(lapply(1:y, function(y) {
        samples = get_samples(samples_per_iter, sampsize=size_per_iter)[, id:=NULL]
        samples[, as.list(bias(
          estimate=.SD,parameter=c(5,3,2,1),type="relative")*100),
          by=.(Species)][, iter:=y]  
      }))
    }
    

    Usage (using defaults)

    get_bias_distribution()    
    

    Output:

            Species Sepal.Length Sepal.Width Petal.Length Petal.Width iter
      1:     setosa    -1.236667    22.61833    -26.70000   -39.69667    1
      2: versicolor    46.476667   -11.99500    115.12833    16.82167    1
      3:  virginica    80.596667    -0.20000    180.21833    53.89000    1
      4:     setosa    -1.513333    20.87000    -27.46167   -38.83667    2
      5: versicolor    45.333333   -11.34833    112.84833    17.84500    2
     ---                                                                  
    296: versicolor    48.250000   -12.26833    113.37000    17.71167   99
    297:  virginica    77.366667    -2.87000    175.60000    53.07167   99
    298:     setosa    -1.005000    22.67500    -27.02833   -39.69500  100
    299: versicolor    47.921667   -10.28333    110.97833    16.86833  100
    300:  virginica    76.153333    -2.44000    174.46167    52.62167  100
    

    Some comments on what was going wrong above

    1. When you call stratified(iris,group="Species", size=1), you will get a 3 row data.table, because you are effectively selecting one row at random from each of the three Species
       Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
    1:          4.9         3.6          1.4         0.1     setosa
    2:          6.3         2.5          4.9         1.5 versicolor
    3:          7.7         2.8          6.7         2.0  virginica
    
    1. When you wrap this in sapply(1:1000, function(x)...), you get 5 x 1000 column matrix, where each column is contains 5 lists of length 3 .. Below, I'm showing you what this looks like if you did sapply(1:6, function(x)...)
                 [,1]      [,2]      [,3]      [,4]      [,5]      [,6]     
    Sepal.Length numeric,3 numeric,3 numeric,3 numeric,3 numeric,3 numeric,3
    Sepal.Width  numeric,3 numeric,3 numeric,3 numeric,3 numeric,3 numeric,3
    Petal.Length numeric,3 numeric,3 numeric,3 numeric,3 numeric,3 numeric,3
    Petal.Width  numeric,3 numeric,3 numeric,3 numeric,3 numeric,3 numeric,3
    Species      factor,3  factor,3  factor,3  factor,3  factor,3  factor,3
    

    This is not really what you want, because you cannot then lapply over these the way you then intended. What you want to do instead is use lapply(1:1000, function(x) ...) to create a list of such 3-row datatables, and then bind them together (after adding an id column to each one).