I have a dataframe df1
which contains 6 columns, two of which (var1
& var3
) I am using to split
df1
by, resulting in a list of dataframes ls1
.
For each sub dataframe in ls1
I want to sample()
x$var2
, x$num
times with x$probs
probabilities as follows:
Create data:
var1 <- rep(LETTERS[seq( from = 1, to = 3 )], each = 6)
var2 <- rep(LETTERS[seq( from = 1, to = 3 )], 6)
var3 <- rep(1:2,3, each = 3)
num <- rep(c(10, 11, 13, 8, 20, 5), each = 3)
probs <- round(runif(18), 2)
df1 <- as.data.frame(cbind(var1, var2, var3, num, probs))
ls1 <- split(df1, list(df1$var1, df1$var3))
have a look at the first couple list elements:
$A.1
var1 var2 var3 num probs
1 A A 1 10 0.06
2 A B 1 10 0.27
3 A C 1 10 0.23
$B.1
var1 var2 var3 num probs
7 B A 1 13 0.93
8 B B 1 13 0.36
9 B C 1 13 0.04
lapply
over ls1
:
ls1 <- lapply(ls1, function(x) {
res <- table(sample(x$var2, size = as.numeric(as.character(x$num)),
replace = TRUE, prob = as.numeric(as.character(x$probs))))
res <- as.data.frame(res)
cbind(x, res = res$Freq)
})
df2 <- do.call("rbind", ls1)
df2
Have a look at the first couple list elements of the result:
$A.1
var1 var2 var3 num probs res
1 A A 1 10 0.06 2
2 A B 1 10 0.27 4
3 A C 1 10 0.23 4
$B.1
var1 var2 var3 num probs res
7 B A 1 13 0.93 10
8 B B 1 13 0.36 3
9 B C 1 13 0.04 0
So for each dataframe a new variable res
is created, the sum of res
equals num
and the elements of var2
are represented in res
in proportions relating to probs
. This does what I want but it becomes very slow when there is a lot of data.
My Question: is there a way to replace the lapply
piece of code with something more efficient/faster?
I am just beginning to learn about vectorization and am guessing this could be vectorized? but I am unsure of how to achieve it.
ls1
is eventually returned to a dataframe structure so if it doesn't need to become a list to begin with all the better (although it doesn't really matter how the data is structured for this step).
Any help would be much appreciated.
First, you should create df1 using data.frame() rather than converting from a matrix, because the matrix forces all data types to the be the same even though you have both numeric and character variables.
df1 <- data.frame(var1, var2, var3, num, probs)
Next, instead of using the sample
function, the rmultinom
function is much more efficient because it directly outputs the number of draws for each value in x$var2:
ls1 <- lapply(ls1, function(x) {
x$res <- rmultinom(1, x$num[1], x$probs)
x
})
This should be noticeably faster than using the sample
approach.