My dataset contains several groups and each group can have a different number of unique observations. I carry out some calculations by group (simplified in the code below), resulting in a summary value for each group. Next, for the purpose of a bootstrap, I want to:
A simplified version of my data set up (data1):
data1:
id group y
1001 1 10
1002 1 15
1003 1 3
3002 2 24
3003 2 15
3005 2 37
3006 2 32
3007 2 11
4001 3 12
4002 3 15
5006 4 7
5007 4 9
5009 4 22
5010 4 19
E.g. based on the dataset example above: there are 4 groups in the original dataset, so I want to sample 4 groups with replacement (e.g. groups sampled = groups 4,3,3,1), and then sample observations/rows from those 4 groups (4 ids from group 4 (e.g. 5007, 5007, 5006, 5009); 2 ids from group 3 (twice, as group 3 was sampled twice), and 3 ids from group 1, all with replacement), and return the sampled rows together in a dataframe (4+2+2+3 = 11 rows).
For the above, I some have code working for these steps separately, but I cannot seem to combine them:
# Calculate group value
y.group <- tapply(data1$y,data1$group,mean)
# Step 1. Sample groups, with replacement:
sampled.group <- sample(1:length(unique(data1$group)),replace=T)
# Step 2. Sample within groups, with replacement
data2 <- data.frame(data1 %>%
group_by(group) %>% # for each group
sample_frac(1, replace = TRUE) %>%
ungroup)
Obviously, the code above in full does not do what I want, as in step 2 the sampled groups from step 1 are ignored since it just uses the original group var (I am aware of this). I have tried to solve this using step 1 and trying to generate a new dataframe containing only the sampled groups' observations (with duplicates if a group was sampled more than once, which is likely to happen), and then apply step 2 to that new dataframe, but I cannot get this to work.
I think I am just on the wrong path or overthinking things. Hopefully you can give me some advice on how to proceed.
Edit: While awaiting any potential solutions, I continued on the question myself and ended up with:
total.result <- c()
for (j in 1:length(unique(data1$group))){
sampled.group <- sample(1:length(unique(data1$group)),size=1,replace=T)
group.result <- sample_n(data1[data1$group==sampled.group,],
size=length(unique(data1$id[data1$group==sampled.group])),replace=T)
total.result <- rbind(total.result,group.result)
}
(So basically using a loop to sample the groups one at a time, creating datasets for each, and then sampling individual rows from those, and finally combining the results with rbind)
However, I think Allan Cameron's solution (see below) is more straigthforward, so I have accepted that one as the answer to my question.
I think this is what you're looking for. Let's start with your data in a reproducible format:
data1 <- structure(list(id = structure(1:14, .Label = c("1001", "1002",
"1003", "3002", "3003", "3005", "3006", "3007", "4001", "4002",
"5006", "5007", "5009", "5010"), class = "factor"), group = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("1",
"2", "3", "4"), class = "factor"), y = structure(c(1L, 4L, 8L,
7L, 4L, 10L, 9L, 2L, 3L, 4L, 11L, 12L, 6L, 5L), .Label = c("10",
"11", "12", "15", "19", "22", "24", "3", "32", "37", "7", "9"
), class = "factor")), class = "data.frame", row.names = c(NA,
-14L))
And just to make sure:
data1
#> id group y
#> 1 1001 1 10
#> 2 1002 1 15
#> 3 1003 1 3
#> 4 3002 2 24
#> 5 3003 2 15
#> 6 3005 2 37
#> 7 3006 2 32
#> 8 3007 2 11
#> 9 4001 3 12
#> 10 4002 3 15
#> 11 5006 4 7
#> 12 5007 4 9
#> 13 5009 4 22
#> 14 5010 4 19
We start by splitting the data frame by group
into smaller data frames, using the split
function. This gives us a list with four data frames, each one containing all the members of its respective group. (The set.seed
is there purely to make this example reproducible).
set.seed(69)
split_dfs <- split(data1, data1$group)
Now we can sample this list, giving us a new list of four data frames drawn with replacement from split_dfs
. Each one will again contain all the members of its respective group, though of course some whole groups might be sampled more than once, and other whole groups not sampled at all.
sampled_group_dfs <- split_dfs[sample(length(split_dfs), replace = TRUE)]
Now we can sample within each group by sampling with replacement from the rows of each data frame in our new list. We do this for all our data frames in our list by using lapply
all_sampled <- lapply(sampled_group_dfs, function(x) x[sample(nrow(x), replace = TRUE), ])
All that remains is to stick all the resultant dataframes in this list back together to get our result:
result <- do.call(rbind, all_sampled)
As you can see from the final result, it just so happens that each of the four groups was sampled once (this is just by chance - alter set.seed to get different results). However, within the groups there have clearly been some duplicates drawn. In fact, since R mandates unique row names in a data frame, these are easy to pick out by the .1
that has been appended to the duplicate row names. If you don't like this, you can reset the row names with rownames(result) <- seq(nrow(result))
result
#> id group y
#> 4.14 5010 4 19
#> 4.14.1 5010 4 19
#> 4.11 5006 4 7
#> 4.13 5009 4 22
#> 1.3 1003 1 3
#> 1.3.1 1003 1 3
#> 1.2 1002 1 15
#> 3.9 4001 3 12
#> 3.9.1 4001 3 12
#> 2.5 3003 2 15
#> 2.5.1 3003 2 15
#> 2.6 3005 2 37
#> 2.7 3006 2 32
#> 2.5.2 3003 2 15
Created on 2020-02-15 by the reprex package (v0.3.0)