I am looking to increase the speed of a code because I will be working with very large data frames. The code below uses two data frames and applies a function that generates a vector of random positive integers of size size
, where the total sum is equal to a specified value sum
.
I want to apply this function using the columns N_h
and s_aL
from the D_T
data frame, and I would like the results to be assigned to the s_aL
column of the D_H
data frame. My code works but it is slow due to the for
loop. Therefore, I am looking for a faster way to do this. I thought of using the dplyr
package.
Here is a reproducible example:
set.seed(1, kind="Mersenne-Twister", normal.kind="Inversion")
generateIntegers <- function(size, sum){
## Randomly generate integers
v <- sort(c(0, sample(0:sum, size = size - 1, replace = TRUE), sum))
## print(v)
## Compute the differences between consecutive integers
dv <- diff(v)
return(dv)
}
D_H <- data.frame(ID = 1:10, patch_ID = sample(1:5, size = 10, replace = TRUE), s_aL = NA) %>%
dplyr::group_by(patch_ID) %>%
dplyr::mutate(N_h = n())
D_T <- data.frame(patch_ID = unique(D_H$patch_ID), s_aL = NA) %>%
dplyr::left_join(unique(D_H[,c("N_h", "patch_ID")]), by = "patch_ID")
D_T$s_aL <- sample(0:5, dim(D_T)[1], replace = TRUE)
for(i in D_T$patch_ID){
a <- generateIntegers(size = D_T[which(D_T$patch_ID == i), c("N_h")], sum = D_T[which(D_T$patch_ID == i), c("s_aL")])
## print(a)
D_H[which(D_H$patch_ID == i), c("s_aL")] <- a
## print(D_H)
}
I initiated this, but I'm facing difficulty here
test <- D_T %>%
dplyr::group_by(patch_ID) %>%
...
For 500k rows of data, I'm seeing a 90x speedup from my approach here, going from 670 seconds to 7.6 seconds (with the speed improvement increasing with size). Seems like the kind of problem with room for way more speed. Curious to see better approaches.
(I'd caveat that different approaches will perform better at different scales. For under 2,000 rows of data, my approach tends to be slower than the original one.)
I run generateIntegers
on a rowwise-grouped D_T
, and add an obs
key to track elements within each patch_ID
. Then I can join D_H
to this. The main bottleneck here is running generateIntegers
separately on each row. Presumably there's a clever way to make that vectorized, the challenge being to match the n() and sum constraints per patch_ID.
rowwise_reframe = function() {
D_H |>
ungroup() |>
select(-s_aL) |>
mutate(obs = row_number(), .by = patch_ID) |>
left_join(
D_T |>
rowwise() |>
reframe(patch_ID,
s_aL = generateIntegers(size = N_h, sum = s_aL)) |>
mutate(obs = row_number(), .by = patch_ID)
) |>
select(ID, patch_ID, s_aL, N_h)
}
Comparing to OP function, altered to leave global variables alone.
OP = function() {
D_H2 = D_H |> ungroup()
for(i in D_T$patch_ID){
a <- generateIntegers(size = D_T[which(D_T$patch_ID == i), c("N_h")], sum = D_T[which(D_T$patch_ID == i), c("s_aL")])
D_H2[which(D_H2$patch_ID == i), c("s_aL")] <- a
}
D_H2
}
Using this code to create variable data size:
set.seed(1, kind="Mersenne-Twister", normal.kind="Inversion")
n = 5E5
max_sum = n/2
D_H <- data.frame(ID = 1:n, patch_ID = sample(1:max_sum, size = n, replace = TRUE),
s_aL = NA) %>%
dplyr::group_by(patch_ID) %>%
dplyr::mutate(N_h = n())
D_T <- data.frame(patch_ID = unique(D_H$patch_ID), s_aL = NA) %>%
dplyr::left_join(unique(D_H[,c("N_h", "patch_ID")]), by = "patch_ID")
D_T$s_aL <- sample(0:max_sum, dim(D_T)[1], replace = TRUE)
microbenchmark::microbenchmark(OP(),
rowwise_reframe(),
times = 1,
check = "identical",
setup = set.seed(1, kind="Mersenne-Twister", normal.kind="Inversion")
)
Unit: seconds
expr min lq mean median uq max neval
OP() 670.673825 670.673825 670.673825 670.673825 670.673825 670.673825 1
rowwise_reframe() 7.615966 7.615966 7.615966 7.615966 7.615966 7.615966 1
EDIT: I'm exploring a vectorized approach to get random numbers with the specified count and sum within each patch_ID. This relies on more grouped calculations, so I'm trying it with dtplyr
(a data.table wrapper with dplyr syntax). So far, it's not showing a lot of performance improvement; maybe it'd work better at larger scales or a better implementation.
My approach is to expand the data based on N_h
, then create random numbers from 0 to 1, then arrange them within each patch_ID with the values closest to 0.5 first, then use that random number rand
* s_aL
/ sum(rand)
for that patch_ID, rounded. That will give integers that will be approximately adding up to the intended total, but often off due to rounding. To fix this, I find the error and nudge the first abs(error)
values up or down to get the sum to match the intended total with the least distortion from the original random values. This won't produce the same random values as the OP approach, but I think they will be equivalently random.
data.frame(patch_ID = rep(D_T$patch_ID, times = D_T$N_h),
s_aL = rep(D_T$s_aL, times = D_T$N_h),
N_h = rep(D_T$N_h, times = D_T$N_h)) |>
dtplyr::lazy_dt() |>
mutate(rand = runif(n())) |>
arrange(patch_ID, abs(rand-0.5)) |>
mutate(obs = row_number(),
s_aL2 = round(rand * s_aL / sum(rand)),
error = s_aL - sum(s_aL2),
s_aL3 = s_aL2 + sign(error) * (obs <= abs(error)),
.by = patch_ID)