Search code examples
rsampling

Generating the large number of samples in r


I have a data frame of 50 rows and 4 columns. I want to get many sample data frames of 12 rows May be a million of them and i do not want my two sample data frames to be same. I have used the following code

    df_l <- list()
    for(i in 1:6000000) {
    set.seed(100+i)
    a <- df[sample(nrow(df),12,replace=T),]
    df_l[[i]] <- a
   rownames(df_l[[i]]) <- 1:12 
   }

But my confusion is this might not be the efficient way to do it and i do not know if two of the sample data-frame are same or not.


Solution

  • You can try the code below:

    • without replacement when sampling
    n <- nrow(df)
    df_1 <- replicate(6000000,df[sample(n,12),],simplify = FALSE)
    
    • with replacement when sampling
    n <- nrow(df)
    df_1 <- replicate(6000000,df[sample(n,12,replace = TRUE),],simplify = FALSE)
    

    Regarding the concern of the same data frames, it depends on the size of space that you are sampling from. For your case,

    • if you don't allow replacement, your space size is choose(50,12)*factorial(12), which is much larger than 6000000. Thus, the probability of collision is low.

    • if you allow replacement, your space size is 50**12*factorial(12), which is even larger than the scenario without replacement. Thus, the probability of collision would be much lower.