I have a dataframe of the form shown below. The cases have been pre-clustered into subgroups of varying populations, including singletons. I am trying to write some code that will sample (without replacement) any specified number of rows from the dataframe, but spread as evenly as possible across clusters.
> testdata
Cluster Name
1 1 A
2 1 B
3 1 C
4 2 D
5 3 E
6 3 F
7 3 G
8 3 H
9 4 I
10 5 J
11 5 K
12 5 L
13 5 M
14 5 N
15 6 O
16 7 P
17 7 Q
For example, if I ask for a sample of 3 rows, I would like to pull a random row from a random 3 clusters (i.e. not first rows of clusters 1-3 every time, though this is one valid outcome).
Acceptable examples:
> testdata_subset
Cluster Name
1 1 A
5 3 E
12 5 L
> testdata_subset
Cluster Name
6 3 F
14 5 N
15 6 O
Incorrect example:
> testdata_subset
Cluster Name
6 3 F
8 3 H
13 5 M
The same idea applies up to a sample size of 7 in the example data shown (1 per cluster). For higher sample sizes, I would like to draw from each cluster evenly as far as possible, then evenly across the remaining clusters with unsampled rows, and so on, until the specified number of rows has been sampled.
I know how to sample N rows indiscriminately:
testdata[sample(nrow(testdata), N),]
But this pays no regard to the clusters. I also used plyr
to randomly sample N rows per cluster:
ddply(testdata,"Cluster", function(z) z[sample(nrow(z), N),])
But this fails as soon as you ask for more rows than there are in a cluster (i.e. if N > 1). I then added an if/else statement to begin to handle that:
numsamp_per_cluster <- 2
ddply(testdata,"Cluster", function(z) if (numsamp_per_cluster > nrow(z)){z[sample(nrow(z), nrow(z)),]} else {z[sample(nrow(z), numsamp_per_cluster),]})
This effectively caps the sample size asked for to the size of each cluster. But in doing so, it loses control of the overall sample size. I am hoping (but starting to doubt) there is an elegant method using dplyr
or similar package that can do this kind of semi-randomised sampling. Either way, I am struggling to tie these elements together and solve the problem.
The strategy: First, you randomly assign the order inside each cluster
. This value is stored in the inside
variable below. Next, you randomly select the order of the first choices of each cluster and so on (outside
variable). Finally, you order your dataframe selecting the first choices, then the second and so on of each cluster, breaking the ties with the outside
variable. Something like that:
set.seed(1)
inside<-ave(seq_along(testdata$Cluster),testdata$Cluster,FUN=function(x) sample(length(x)))
outside<-ave(inside,inside,FUN=function(x) sample(seq_along(x)))
testdata[order(inside,outside),]
# Cluster Name
#10 5 J
#15 6 O
#4 2 D
#5 3 E
#9 4 I
#16 7 P
#1 1 A
#13 5 M
#3 1 C
#17 7 Q
#7 3 G
#6 3 F
#14 5 N
#2 1 B
#12 5 L
#8 3 H
#11 5 K
Now, selecting the first n
rows of the resulting data.frame you get the sample you are looking for.