Search code examples
rrandomsubsetsample

Creating multiple training subsets using sample() in R


I have a training dataset that consists of 60,000 observations that I want to create 9 subset training sets from. I want to sample randomly without replacement; I need 3 datasets of 500 observations, 3 datasets of 1,000 observations, and 3 datasets of 2,000 observations.

enter image description here

How can I do this using sample() in R?


Solution

  • Given your data.frame is named df you do:

    sample_sizes <- c(rep(500,3), rep(1000,3), rep(2000,3))
    sampling <- sample(60000, sum(sample_sizes))
    training_sets <- split(df[sampling,], rep(1:9, sample_sizes)) 
    

    This do sampling without replacement over all dataset. If you want sampling without replacement in each training set (but not through all training sets):

    sample_sizes <- c(rep(500,3), rep(1000,3), rep(2000,3))
    sampling <- do.call(c, lapply(sample_sizes, function(i) sample(60000, i)))
    training_sets <- split(df[sampling,], rep(1:9, sample_sizes))