I'm fairly new to R. Now I'm stuck with Stratified sampling when sample size changes based on group.
The data looks like this:
And the sample size varies based on different group or strata:
I used stratified sampling, but can't figure out the sample size.
Result <- stratified(Population, c("Loc", "Format"),
Population$SampleSize), replace = FALSE,
keep.rownames = T)
An error message saying " size should be entered as a named vector". Could anyone help? Thank you.
I assume you're using stratified
from my "splitstackshape" package.
The error explains what's required: a named vector (something like c(a = 5, b = 10)
, for example).
However, that feature of the function assumes only one variable being used for stratification. To fix this, you can just create a new grouping variable by pasting together your "Loc" and "Format" columns.
Here's a simple example....
Start with some sample data of your original dataset and a dataset that indicates the sample sizes you want.
library(splitstackshape)
set.seed(1)
mydf <- data.table(strata1 = sample(letters[1:2], 25, TRUE),
strata2 = sample(c("A", "B"), 25, TRUE),
values = sample(25, replace = TRUE))
head(mydf)
# strata1 strata2 values
# 1: a A 12
# 2: a A 22
# 3: b A 11
# 4: b B 7
# 5: a A 2
# 6: b A 3
wanted <- data.table(strata1 = c("a", "a", "b", "b"),
strata2 = c("A", "B", "A", "B"),
count = c(2, 3, 5, 2))
wanted
# strata1 strata2 count
# 1: a A 2
# 2: a B 3
# 3: b A 5
# 4: b B 2
To get the output, we'll add a column called "KEY" combining the two stratifying columns. You can do that to both of the datasets, but I simply did it on the fly with the "wanted" dataset.
out <- stratified(
mydf[, KEY := paste(strata1, strata2, sep = "_")], "KEY",
with(wanted, setNames(count, paste(strata1, strata2, sep = "_"))))
out
# strata1 strata2 values KEY
# 1: a A 21 a_A
# 2: a A 2 a_A
# 3: a B 9 a_B
# 4: a B 3 a_B
# 5: a B 9 a_B
# 6: b A 17 b_A
# 7: b A 12 b_A
# 8: b A 3 b_A
# 9: b A 17 b_A
# 10: b A 13 b_A
# 11: b B 8 b_B
# 12: b B 20 b_B
Compare the resulting sample sizes by the original stratification variables:
out[, .N, .(strata1, strata2)]
# strata1 strata2 N
# 1: a A 2
# 2: a B 3
# 3: b A 5
# 4: b B 2