Search code examples
rmatrixsamplingresampling

Resampling from subject id's in R


Assume we have the following data

set.seed(123)
dat <- data.frame(var1=c(10,35,13,19,15,20,19), id=c(1,1,2,2,2,3,4))
(sampledIDs <- sample(min(dat$id):max(dat$id), size=3, replace=TRUE))
> [1] 2 4 2

The sampledIDs is a vector of id's that is sampled (with replacement) from dat$id. I need the code that results in (and works also for a large dataset with more variables):

  var1 id
   13  2
   19  2
   15  2
   19  4
   13  2
   19  2
   15  2

The code dat[which(dat$id%in%sampledIDs),] does not give me what I want, since the the result of this code is

  var1 id
    13  2
    19  2
    15  2
    19  4

where the subject with dat$id==2 appears only once in this data (I understand why this is the result, but don't know how to get what I want). Can someone please help?


EDIT: Thank you for the answers, here the runtime of all answers (for those who are interested):

                                                                 test replications elapsed relative user.self
3   dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ]         1000    0.67    1.000      0.64
1 dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[, 1], ]         1000    0.67    1.000      0.67
2        do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)])         1000    1.83    2.731      1.83
4                               setkey(setDT(dat), id)[J(sampledIDs)]         1000    1.33    1.985      1.33

Solution

  • This would be probably the fastest approach for a big data set using data.table binary search

    library(data.table)
    setkey(setDT(dat), id)[J(sampledIDs)]
    #    var1 id
    # 1:   13  2
    # 2:   19  2
    # 3:   15  2
    # 4:   19  4
    # 5:   13  2
    # 6:   19  2
    # 7:   15  2
    

    Edit: Here's a benchmark for a not so big data set (1e+05 rows) which illustrates which is the clear winner

    library(data.table)
    library(microbenchmark)
    
    set.seed(123)
    n <- 1e5
    dat <- data.frame(var1 = sample(seq_len(100), n, replace = TRUE), id = sample(seq_len(10), n, replace = TRUE))
    (sampledIDs <- sample(min(dat$id) : max(dat$id), size = 3, replace = TRUE))
    dat2 <- copy(dat)
    
    Sven1 <- function(dat) dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ]
    Sven2 <- function(dat) dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[ , 1], ]
    flodel <- function(dat) do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)])
    David <- function(dat2) setkey(setDT(dat2), id)[J(sampledIDs)]
    
    Res <- microbenchmark(Sven1(dat),
                          Sven2(dat), 
                          flodel(dat), 
                          David(dat2))
    Res
    # Unit: milliseconds
    #        expr       min        lq    median        uq       max neval
    #  Sven1(dat)  4.356151  4.817557  6.715533  7.313877 45.407768   100
    #  Sven2(dat)  9.750984 12.385677 14.324671 16.655005 54.797096   100
    # flodel(dat) 36.097602 39.680006 42.236017 44.314981 82.261879   100
    # David(dat2)  1.813387  2.068749  2.154774  2.335442  8.665379   100
    
    boxplot(Res)
    

    enter image description here


    If, for example, we would like to sample more then just 3 Ids, but lets say, 10, the benchmark becomes ridiculous

    (sampledIDs <- sample(min(dat$id) : max(dat$id), size = 10, replace = TRUE))
    [1]  7  6 10  9  5  9  5  3  7  3
    # Unit: milliseconds
    #       expr        min         lq     median         uq       max neval
    #  Sven1(dat)  80.124502  89.141162  97.908365 104.111738 175.40919   100
    #  Sven2(dat)  99.010410 127.797966 159.404395 170.751069 209.96887   100
    # flodel(dat) 129.722435 144.847505 157.737362 178.242103 232.41293   100
    # David(dat2)   2.431682   2.721038   2.855103   3.057796  19.60826   100
    

    enter image description here