stratified sampling size varies based on group in R

I'm fairly new to R. Now I'm stuck with Stratified sampling when sample size changes based on group.

The data looks like this:

And the sample size varies based on different group or strata:

I used stratified sampling, but can't figure out the sample size.

Result <- stratified(Population, c("Loc", "Format"), 
                 Population$SampleSize), replace = FALSE, 
                 keep.rownames = T)

An error message saying " size should be entered as a named vector". Could anyone help? Thank you.

Solution

I assume you're using stratified from my "splitstackshape" package.

The error explains what's required: a named vector (something like c(a = 5, b = 10), for example).

However, that feature of the function assumes only one variable being used for stratification. To fix this, you can just create a new grouping variable by pasting together your "Loc" and "Format" columns.

Here's a simple example....

Start with some sample data of your original dataset and a dataset that indicates the sample sizes you want.

library(splitstackshape)
set.seed(1)
mydf <- data.table(strata1 = sample(letters[1:2], 25, TRUE), 
                   strata2 = sample(c("A", "B"), 25, TRUE), 
                   values = sample(25, replace = TRUE))
head(mydf)
#    strata1 strata2 values
# 1:       a       A     12
# 2:       a       A     22
# 3:       b       A     11
# 4:       b       B      7
# 5:       a       A      2
# 6:       b       A      3

wanted <- data.table(strata1 = c("a", "a", "b", "b"),
                     strata2 = c("A", "B", "A", "B"),
                     count = c(2, 3, 5, 2))
wanted
#    strata1 strata2 count
# 1:       a       A     2
# 2:       a       B     3
# 3:       b       A     5
# 4:       b       B     2

To get the output, we'll add a column called "KEY" combining the two stratifying columns. You can do that to both of the datasets, but I simply did it on the fly with the "wanted" dataset.

out <- stratified(
  mydf[, KEY := paste(strata1, strata2, sep = "_")], "KEY",
  with(wanted, setNames(count, paste(strata1, strata2, sep = "_"))))
out
#     strata1 strata2 values KEY
#  1:       a       A     21 a_A
#  2:       a       A      2 a_A
#  3:       a       B      9 a_B
#  4:       a       B      3 a_B
#  5:       a       B      9 a_B
#  6:       b       A     17 b_A
#  7:       b       A     12 b_A
#  8:       b       A      3 b_A
#  9:       b       A     17 b_A
# 10:       b       A     13 b_A
# 11:       b       B      8 b_B
# 12:       b       B     20 b_B

Compare the resulting sample sizes by the original stratification variables:

out[, .N, .(strata1, strata2)]
#    strata1 strata2 N
# 1:       a       A 2
# 2:       a       B 3
# 3:       b       A 5
# 4:       b       B 2