I have a data frame of 50 rows and 4 columns. I want to get many sample data frames of 12 rows May be a million of them and i do not want my two sample data frames to be same. I have used the following code
df_l <- list()
for(i in 1:6000000) {
set.seed(100+i)
a <- df[sample(nrow(df),12,replace=T),]
df_l[[i]] <- a
rownames(df_l[[i]]) <- 1:12
}
But my confusion is this might not be the efficient way to do it and i do not know if two of the sample data-frame are same or not.
You can try the code below:
n <- nrow(df)
df_1 <- replicate(6000000,df[sample(n,12),],simplify = FALSE)
n <- nrow(df)
df_1 <- replicate(6000000,df[sample(n,12,replace = TRUE),],simplify = FALSE)
Regarding the concern of the same data frames, it depends on the size of space that you are sampling from. For your case,
if you don't allow replacement, your space size is choose(50,12)*factorial(12)
, which is much larger than 6000000
. Thus, the probability of collision is low.
if you allow replacement, your space size is 50**12*factorial(12)
, which is even larger than the scenario without replacement. Thus, the probability of collision would be much lower.