Search code examples
rsampling

stratified sampling size varies based on group in R


I'm fairly new to R. Now I'm stuck with Stratified sampling when sample size changes based on group.

The data looks like this:

enter image description here

And the sample size varies based on different group or strata:

enter image description here

I used stratified sampling, but can't figure out the sample size.

Result <- stratified(Population, c("Loc", "Format"), 
                 Population$SampleSize), replace = FALSE, 
                 keep.rownames = T)

An error message saying " size should be entered as a named vector". Could anyone help? Thank you.


Solution

  • I assume you're using stratified from my "splitstackshape" package.

    The error explains what's required: a named vector (something like c(a = 5, b = 10), for example).

    However, that feature of the function assumes only one variable being used for stratification. To fix this, you can just create a new grouping variable by pasting together your "Loc" and "Format" columns.

    Here's a simple example....

    Start with some sample data of your original dataset and a dataset that indicates the sample sizes you want.

    library(splitstackshape)
    set.seed(1)
    mydf <- data.table(strata1 = sample(letters[1:2], 25, TRUE), 
                       strata2 = sample(c("A", "B"), 25, TRUE), 
                       values = sample(25, replace = TRUE))
    head(mydf)
    #    strata1 strata2 values
    # 1:       a       A     12
    # 2:       a       A     22
    # 3:       b       A     11
    # 4:       b       B      7
    # 5:       a       A      2
    # 6:       b       A      3
    
    wanted <- data.table(strata1 = c("a", "a", "b", "b"),
                         strata2 = c("A", "B", "A", "B"),
                         count = c(2, 3, 5, 2))
    wanted
    #    strata1 strata2 count
    # 1:       a       A     2
    # 2:       a       B     3
    # 3:       b       A     5
    # 4:       b       B     2
    

    To get the output, we'll add a column called "KEY" combining the two stratifying columns. You can do that to both of the datasets, but I simply did it on the fly with the "wanted" dataset.

    out <- stratified(
      mydf[, KEY := paste(strata1, strata2, sep = "_")], "KEY",
      with(wanted, setNames(count, paste(strata1, strata2, sep = "_"))))
    out
    #     strata1 strata2 values KEY
    #  1:       a       A     21 a_A
    #  2:       a       A      2 a_A
    #  3:       a       B      9 a_B
    #  4:       a       B      3 a_B
    #  5:       a       B      9 a_B
    #  6:       b       A     17 b_A
    #  7:       b       A     12 b_A
    #  8:       b       A      3 b_A
    #  9:       b       A     17 b_A
    # 10:       b       A     13 b_A
    # 11:       b       B      8 b_B
    # 12:       b       B     20 b_B
    

    Compare the resulting sample sizes by the original stratification variables:

    out[, .N, .(strata1, strata2)]
    #    strata1 strata2 N
    # 1:       a       A 2
    # 2:       a       B 3
    # 3:       b       A 5
    # 4:       b       B 2