Search code examples
performancerloopssampling

Efficiently sample a data frame avoiding loops


I have a data frame which consists of a first column (experiment.id) and the rest of the columns are values associated with this experiment id. Each row is a unique experiment id. My data frame has columns in the order of 10⁴ - 10⁵.

data.frame(experiment.id=1:100, v1=rnorm(100,1,2),v2=rnorm(100,-1,2) )

This data frame is the source of my sample space. What i would like to do, is for each unique experiment.id (row) randomly sample (with replacement) one of the values v1, v2, ....,v10000 associated with this id and construct a sample s1. In each sample s1 all experiment ids are represented.

Eventually i want to perform 10⁴ samples, s1, s2, ....,s 10⁴ and calculate some statistic.

What would be the most efficient way (computationally) to perform this sampling process. I would like to avoid for loops as much as possible.

Update: My questions in not all about sampling but also storing the samples. I guess my real question is if there is a quicker way to perform the above other than

d<-data.frame(experiment.id=1:1000, replicate (10000,rnorm(1000,100,2)) )
results<-data.frame(d$experiment.id,replicate(n=10000,apply(d[,2:10001],1,function(x){sample(x,size=1,replace=T)})))

Solution

  • The shortest and most readable IMHO is still to use apply, but making good use of the fact that sample is vectorized:

    results <- data.frame(experiment.id = d$experiment.id,
                          t(apply(d[, -1], 1, sample, 10000, replace = TRUE)))
    

    If the 3 seconds it takes are too slow for your needs then I would recommend you use matrix indexing.