Search code examples
rplyrsampling

Sampling small data frame from a big dataframe


I am trying to sample a data frame from a given data frame such that there are enough samples from each of the levels of a variable. This can be achieved by separating the data frame by the levels and sample from each of those . I thought ddply (data-frame to data-frame) would do it for me. Taking a minimal example:

set.seed(1)
data1 <-data.frame(a=sample(c('B0','B1','B2'),100,replace=TRUE),b=rnorm(100),c=runif(100))
> summary(data1$a)
B0 B1 B2 
30 32 38

The following commands perform the sampling...

When I enter...

data2 <- ddply(data1,c('a'),function(x) sample(x,20,replace=FALSE))

I get the following error

Error in [.data.frame(x, .Internal(sample(length(x), size, replace, : cannot take a sample larger than the population when 'replace = FALSE'

This error is because x inside the ddply function is not a vector but a dataframe.

Does anyone have any idea on how to achieve this sampling? I know one way is to not use ddply and just do (1) segregation, (2) sampling, and (3) collation in three steps. But I was wondering there must by some way ...with base or plyr functions...

Thank you for your help...


Solution

  • I think what you want is to subset the data frame passed in x using sample:

    ddply(data1,.(a),function(x) x[sample(nrow(x),20,replace = FALSE),])
    

    But, of course, you still need to take care that the size of the sample for each piece (in this case 20) is at least as big as the smallest subset of your data based on the levels of a.