Search code examples
rsampling

How to sample with various sample size in R?


I am trying to get a random sample from a dataframe with different size. example the first sample should only have 8 observations 2nd sample can have 10 observations 3rd can have 12 observations

df[sample(nrow(df),10 ), ]

this gives me a fixed 10 observations when I take a sample

In an ideal case, I have 100observations and these observations should be placed in 3 groups without replacement and each group can have any number of observations. example group 1 has 45 observations, group 2 has 20 observations and group 3 has 35 observations.

Any help will be appreciated


Solution

  • You could try using replicate:

    times_to_sample = 5L
    NN = nrow(df)
    replicate(times_to_sample, df[sample(NN, sample(5:10, 1L)), ], simplify = FALSE)
    

    This will return a list of length times_to_sample, the ith element of which will give you a data.frame with the result for the ith replication.

    simplify=FALSE prevents simplify2array from mangling the results into a not-particularly-useful matrix.

    You should also consider adding some robustness checks -- for example, you said you want between 5 and 10 rows, but in generalizing this to be from a to b rows, you'll want to ensure a >= 1, b <= nrow(df).

    If times_to_sample is going to be large, it'll be more efficient to get all of the samples from 5:10 up front instead:

    idx = sample(5:10, times_to_sample, replace = TRUE)
    lapply(idx, function(i) df[sample(NN, i), ])
    

    A little less readable but surely more efficient than to repeatedly to sample(5:10, 1), i.e. only one at a time (not leveraging vectorization)