Search code examples
rrandomreplacesample

Randomly sample a row from a list of rows/values with different length according each step in R


I have a list in which each row are differente registers of several species (that might repeat across the list). Each of these species belong to a given database (no species repeated inside the same dataset).

I need to randomly sample different registers (rows), however I want that the number of samples change with the number of the "step".

In the reproducible example (below), I would like:
step 1: 1 random sample (row),
step 2: 2 random samples (rows) from different datasets
...
step 11: 11 random samples (rows) from different datasets.

#Example:
x1 <- matrix(rnorm(200), nrow= 100, ncol=2)
x2 <- c(replicate(5, "AA"),replicate(15, "BB"),replicate(15, "CC"),
        replicate(10, "DD"),replicate(10, "EE"),replicate(10, "FF"),
        replicate(10, "GG"),replicate(5, "HH"),replicate(5, "II"),
        replicate(15, "JJ"))
df <- data.frame(cbind(x1,x2))
colnames(df) <- c("variable1", "variable2","dataset")

The only thing I tried, but still is not what I want... because is sampling only accordingly to the dataset

install.packages("sampling")
library(sampling)

ob <- strata(df, "dataset", size = c(1:100), method = "srswr")

Any thoughts, please?


Solution

  • If I understand you correctly, I think you want something like this (note, it ensures that at step n, there are n rows selected from n different datasets -- if that is not what you want, I can adjust):

    library(data.table)
    setDT(df)
    
    lapply(1:5, \(i) {
      ds = sample(unique(df$dataset),i)
      df[dataset %chin% ds, .SD[sample(.N,1)], dataset]
    })
    

    Output:

    [[1]]
       dataset         variable1          variable2
    1:      GG 0.891759430683143 -0.973274707214832
    
    [[2]]
       dataset          variable1          variable2
    1:      FF -0.187478493738627 -0.643696750490574
    2:      GG  0.776141815765327 -0.825979276855279
    
    [[3]]
       dataset         variable1         variable2
    1:      BB 0.251972001607678  1.19219655379958
    2:      CC  1.48277044726544  1.43059055432907
    3:      II 0.621661527125061 -1.29864843731135
    
    [[4]]
       dataset          variable1          variable2
    1:      CC  0.521363736653211  0.512012278191707
    2:      FF -0.946818003900703  -0.73084715717486
    3:      GG  0.891759430683143 -0.973274707214832
    4:      II  0.586691851645424 -0.216393669254661
    
    [[5]]
       dataset          variable1          variable2
    1:      CC    2.3988956446685  0.993219087408849
    2:      EE  0.545675659181279 -0.185124394415505
    3:      FF -0.187478493738627 -0.643696750490574
    4:      GG -0.335332679807122   -0.2908242586079
    5:      JJ  -1.91097794113304  0.886747918349373