Random sampling R

I am new to R and trying to exploit a fairly simple task. I have a dataset composed of 20 obs of 19 variabile and I want to generate three non overlapping groups of 5 obs. I am using the slice_sample function from dplyr package, but how do I reiterate excluding the obs already picked up in the first round?

library( "dplyr") set.seed(123)

NF_1 <- slice_sample(NF, n = 5)

Solution

You can use the sample function from base R.

All you have to do is sample the rows with replace = FALSE, which means you won't have any overlapping. You can also define the number of samples.

n_groups <- 3
observations_per_group <- 5
size <- n_groups * obersavations_per_group
selected_samples <- sample(seq_len(nrow(NF)), size = size, replace = FALSE)

# Now index those selected rows
NF_1 <- NF[selected_samples, ]

Now, according to your comment, if you want to generate N dataframes, each with a number of samples and also label them accordingly, you can use lapply (which is a function that "applies" a function to a set of values). The "l" in "lapply" means that it returns a list. There are other types of apply functions. You can read more about that (and I highly recommend that you do!) here.

This code should solve your problem, or at least give you a good idea or where to go.

n_groups <- 3
observations_per_group <- 5
size <- observations_per_group * n_groups

# First we'll get the row samples.
selected_samples <- sample(
    seq_len(nrow(NF)),
    size = size,
    replace = FALSE
)

# Now we split them between the number of groups
split_samples <- split(
    selected_samples,
    rep(1:n_groups, observations_per_group)
)

# For each group (1 to n_groups) we'll define a dataframe with samples
# and store them sequentially in a list.

my_dataframes <- lapply(1:n_groups, function(x) {
    # our subset df will be the original df with the list of samples
    # for group at position "x" (1, 2, 3.., n_groups)
    subset_df <- NF[split_samples[x], ]
    return(subset_df)
})

# now, if you need to access the results, you can simply do:
first_df <- my_dataframes[[1]] # use double brackets to access list elements