Search code examples
rvectorizationlapplysample

Efficient sampling of factor variable from dataframe subsets


I have a dataframe df1 which contains 6 columns, two of which (var1 & var3) I am using to split df1 by, resulting in a list of dataframes ls1.

For each sub dataframe in ls1 I want to sample() x$var2, x$num times with x$probs probabilities as follows:

Create data:

var1 <- rep(LETTERS[seq( from = 1, to = 3 )], each = 6)
var2 <- rep(LETTERS[seq( from = 1, to = 3 )], 6)
var3 <- rep(1:2,3, each = 3)
num <- rep(c(10, 11, 13, 8, 20, 5), each = 3)
probs <- round(runif(18), 2)
df1 <- as.data.frame(cbind(var1, var2, var3, num, probs))
ls1 <- split(df1, list(df1$var1, df1$var3))

have a look at the first couple list elements:

$A.1
  var1 var2 var3 num probs
1    A    A    1  10  0.06
2    A    B    1  10  0.27
3    A    C    1  10  0.23

$B.1
  var1 var2 var3 num probs
7    B    A    1  13  0.93
8    B    B    1  13  0.36
9    B    C    1  13  0.04

lapply over ls1:

ls1 <- lapply(ls1, function(x) { 
  res <- table(sample(x$var2, size = as.numeric(as.character(x$num)), 
    replace = TRUE, prob = as.numeric(as.character(x$probs))))
  res <- as.data.frame(res)
  cbind(x, res = res$Freq)
})
df2 <- do.call("rbind", ls1)
df2

Have a look at the first couple list elements of the result:

$A.1
  var1 var2 var3 num probs res
1    A    A    1  10  0.06   2
2    A    B    1  10  0.27   4
3    A    C    1  10  0.23   4

$B.1
  var1 var2 var3 num probs res
7    B    A    1  13  0.93  10
8    B    B    1  13  0.36   3
9    B    C    1  13  0.04   0

So for each dataframe a new variable res is created, the sum of res equals num and the elements of var2 are represented in res in proportions relating to probs. This does what I want but it becomes very slow when there is a lot of data.

My Question: is there a way to replace the lapply piece of code with something more efficient/faster?

I am just beginning to learn about vectorization and am guessing this could be vectorized? but I am unsure of how to achieve it.

ls1 is eventually returned to a dataframe structure so if it doesn't need to become a list to begin with all the better (although it doesn't really matter how the data is structured for this step).

Any help would be much appreciated.


Solution

  • First, you should create df1 using data.frame() rather than converting from a matrix, because the matrix forces all data types to the be the same even though you have both numeric and character variables.

    df1 <- data.frame(var1, var2, var3, num, probs)
    

    Next, instead of using the sample function, the rmultinom function is much more efficient because it directly outputs the number of draws for each value in x$var2:

    ls1 <- lapply(ls1, function(x) { 
        x$res <- rmultinom(1, x$num[1], x$probs)
        x
    })
    

    This should be noticeably faster than using the sample approach.