I am trying to get a random sample from a dataframe with different size. example the first sample should only have 8 observations 2nd sample can have 10 observations 3rd can have 12 observations
df[sample(nrow(df),10 ), ]
this gives me a fixed 10 observations when I take a sample
In an ideal case, I have 100observations and these observations should be placed in 3 groups without replacement and each group can have any number of observations. example group 1 has 45 observations, group 2 has 20 observations and group 3 has 35 observations.
Any help will be appreciated
You could try using replicate
:
times_to_sample = 5L
NN = nrow(df)
replicate(times_to_sample, df[sample(NN, sample(5:10, 1L)), ], simplify = FALSE)
This will return a list
of length times_to_sample
, the i
th element of which will give you a data.frame
with the result for the i
th replication.
simplify=FALSE
prevents simplify2array
from mangling the results into a not-particularly-useful matrix.
You should also consider adding some robustness checks -- for example, you said you want between 5 and 10 rows, but in generalizing this to be from a
to b
rows, you'll want to ensure a >= 1
, b <= nrow(df)
.
If times_to_sample
is going to be large, it'll be more efficient to get all of the samples from 5:10
up front instead:
idx = sample(5:10, times_to_sample, replace = TRUE)
lapply(idx, function(i) df[sample(NN, i), ])
A little less readable but surely more efficient than to repeatedly to sample(5:10, 1)
, i.e. only one at a time (not leveraging vectorization)