Search code examples
rgroupingsampling

Random sampling R


I am new to R and trying to exploit a fairly simple task. I have a dataset composed of 20 obs of 19 variabile and I want to generate three non overlapping groups of 5 obs. I am using the slice_sample function from dplyr package, but how do I reiterate excluding the obs already picked up in the first round?

library( "dplyr") set.seed(123)

NF_1 <- slice_sample(NF, n = 5)


Solution

  • You can use the sample function from base R.

    All you have to do is sample the rows with replace = FALSE, which means you won't have any overlapping. You can also define the number of samples.

    n_groups <- 3
    observations_per_group <- 5
    size <- n_groups * obersavations_per_group
    selected_samples <- sample(seq_len(nrow(NF)), size = size, replace = FALSE)
    
    # Now index those selected rows
    NF_1 <- NF[selected_samples, ]
    

    Now, according to your comment, if you want to generate N dataframes, each with a number of samples and also label them accordingly, you can use lapply (which is a function that "applies" a function to a set of values). The "l" in "lapply" means that it returns a list. There are other types of apply functions. You can read more about that (and I highly recommend that you do!) here.

    This code should solve your problem, or at least give you a good idea or where to go.

    n_groups <- 3
    observations_per_group <- 5
    size <- observations_per_group * n_groups
    
    # First we'll get the row samples.
    selected_samples <- sample(
        seq_len(nrow(NF)),
        size = size,
        replace = FALSE
    )
    
    # Now we split them between the number of groups
    split_samples <- split(
        selected_samples,
        rep(1:n_groups, observations_per_group)
    )
    
    # For each group (1 to n_groups) we'll define a dataframe with samples
    # and store them sequentially in a list.
    
    my_dataframes <- lapply(1:n_groups, function(x) {
        # our subset df will be the original df with the list of samples
        # for group at position "x" (1, 2, 3.., n_groups)
        subset_df <- NF[split_samples[x], ]
        return(subset_df)
    })
    
    # now, if you need to access the results, you can simply do:
    first_df <- my_dataframes[[1]] # use double brackets to access list elements