Say I have some data that has multiple data points where some share a group identifier:
group <- rep(c(1:5), times=3)
cost <- rnorm(length(group), 100, 5)
current_score <- rnorm(length(group), 7, 2)
future_score <- current_score*runif(1)
dat <- data.frame(group, cost, current_score, future_score)
and a function that gives an overall weighted group score:
wt_score <- function(group, dat)
{
one_group_dat <- dat[dat$group == group, ]
wt_score <- sum(one_group_dat$cost * (one_group_dat$current_score - one_group_dat$future_score))/sum(one_group_dat$cost)
return(wt_score)
}
Is there a way to vectorize the above function so that I don't have to use a loop like below? The issue is that in practice, a function is applied to tens of thousands of groups and millions of data points, so the loop is very slow.
# THIS IS TOO SLOW!
dat$wt_score <- 0
for(i in 1:nrow(dat))
{
dat$wt_score[i] <- wt_score(dat$group[i], dat)
}
You don't need to have a function:
library(dplyr) # dplyr_1.1.0
dat %>%
mutate(wt_score2 = sum(cost * (current_score - future_score))/sum(cost), .by = group)
# group cost current_score future_score wt_score wt_score2
# 1 1 110.95492 9.377867 0.05744175 7.745383 7.745383
# 2 2 103.20114 4.725669 0.02894589 6.618164 6.618164
# 3 3 100.61319 9.533577 0.05839552 7.247275 7.247275
# 4 4 98.45976 3.680751 0.02254551 6.894018 6.894018
# 5 5 104.48396 5.964945 0.03653676 7.095793 7.095793
# 6 1 104.05087 6.718559 0.04115283 7.745383 7.745383
# 7 2 98.18486 9.166696 0.05614828 6.618164 6.618164
# 8 3 93.50934 7.118524 0.04360272 7.247275 7.247275
# 9 4 101.75665 6.385142 0.03911057 6.894018 6.894018
# 10 5 100.79521 7.129071 0.04366732 7.095793 7.095793
# 11 1 92.10072 7.097935 0.04347660 7.745383 7.745383
# 12 2 104.64702 6.212637 0.03805394 6.618164 6.618164
# 13 3 95.33966 5.096398 0.03121670 7.247275 7.247275
# 14 4 107.13121 10.452435 0.06402375 6.894018 6.894018
# 15 5 102.71442 8.344599 0.05111273 7.095793 7.095793
(wt_score
is from your for
loop, wt_score2
is identical).
(If you're using dplyr
version before 1.1, then you cannot use .by=
... remove that, and instead use dat %>% group_by(group) %>% mutate(...)
.)
While you tagged dplyr, this can be done in base R as well.
dat$wt_score3 <- ave(
1:nrow(dat), dat$group,
FUN = function(z) sum(dat$cost[z] * (dat$current_score[z] - dat$future_score[z]))/sum(dat$cost[z]))
dat
# group cost current_score future_score wt_score wt_score3
# 1 1 110.95492 9.377867 0.05744175 7.745383 7.745383
# 2 2 103.20114 4.725669 0.02894589 6.618164 6.618164
# 3 3 100.61319 9.533577 0.05839552 7.247275 7.247275
# 4 4 98.45976 3.680751 0.02254551 6.894018 6.894018
# 5 5 104.48396 5.964945 0.03653676 7.095793 7.095793
# 6 1 104.05087 6.718559 0.04115283 7.745383 7.745383
# 7 2 98.18486 9.166696 0.05614828 6.618164 6.618164
# 8 3 93.50934 7.118524 0.04360272 7.247275 7.247275
# 9 4 101.75665 6.385142 0.03911057 6.894018 6.894018
# 10 5 100.79521 7.129071 0.04366732 7.095793 7.095793
# 11 1 92.10072 7.097935 0.04347660 7.745383 7.745383
# 12 2 104.64702 6.212637 0.03805394 6.618164 6.618164
# 13 3 95.33966 5.096398 0.03121670 7.247275 7.247275
# 14 4 107.13121 10.452435 0.06402375 6.894018 6.894018
# 15 5 102.71442 8.344599 0.05111273 7.095793 7.095793
stats::ave
does not take multiple arguments, so we feed it instead the row indices and handle extracting from the frame internally to the anonymous FUN
ction.
Benchmarking with a small frame is premature and should not be trusted. For example, using bench::mark
on the three on 15 rows, we see
# A tibble: 3 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 ave 48.27µs 56.18µs 17541. 0B 8.75 8020 4 457ms <NULL> <Rprofmem [0 × 3]> <bench_tm [8,024]> <tibble [8,024 × 3]>
2 dplyr 1.06ms 1.13ms 882. 4.85KB 5.35 330 2 374ms <NULL> <Rprofmem [16 × 3]> <bench_tm [332]> <tibble [332 × 3]>
3 forloop 524.06µs 621.33µs 1622. 0B 8.71 745 4 459ms <NULL> <Rprofmem [0 × 3]> <bench_tm [749]> <tibble [749 × 3]>
where `itr/sec`
(higher is better) and median
(lower is better) are two good measures. However, if we go with a much larger frame,
datbig <- bind_rows(replicate(1000, dat, simplify=FALSE))
nrow(datbig)
# [1] 15000
bench::mark(...)
# A tibble: 3 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 ave 764.12µs 833.87µs 971. 1.39MB 9.85 493 5 507.87ms <NULL> <Rprofmem [42 × 3]> <bench_tm [493]> <tibble [493 × 3]>
2 dplyr 1.34ms 1.41ms 666. 1013.19KB 4.00 333 2 500.2ms <NULL> <Rprofmem [49 × 3]> <bench_tm [333]> <tibble [333 × 3]>
3 forloop 3.58s 3.58s 0.280 12.38GB 21.5 1 77 3.58s <NULL> <Rprofmem [375,000 × 3]> <bench_tm [1]> <tibble [1 × 3]>
we can see different gains. Another (big!) factor is the number of groups, where I suspect many-more-groups will change the results. (The for
loop is abysmally slow here ...)
Another alternative: iterate over the unique groups instead of each row.
for (grp in unique(dat$group)) {
dat$wt_score3[dat$group == grp] <- wt_score(grp, dat)
}
dat
# group cost current_score future_score wt_score wt_score3
# 1 1 110.95492 9.377867 0.05744175 7.745383 7.745383
# 2 2 103.20114 4.725669 0.02894589 6.618164 6.618164
# 3 3 100.61319 9.533577 0.05839552 7.247275 7.247275
# 4 4 98.45976 3.680751 0.02254551 6.894018 6.894018
# 5 5 104.48396 5.964945 0.03653676 7.095793 7.095793
# 6 1 104.05087 6.718559 0.04115283 7.745383 7.745383
# 7 2 98.18486 9.166696 0.05614828 6.618164 6.618164
# 8 3 93.50934 7.118524 0.04360272 7.247275 7.247275
# 9 4 101.75665 6.385142 0.03911057 6.894018 6.894018
# 10 5 100.79521 7.129071 0.04366732 7.095793 7.095793
# 11 1 92.10072 7.097935 0.04347660 7.745383 7.745383
# 12 2 104.64702 6.212637 0.03805394 6.618164 6.618164
# 13 3 95.33966 5.096398 0.03121670 7.247275 7.247275
# 14 4 107.13121 10.452435 0.06402375 6.894018 6.894018
# 15 5 102.71442 8.344599 0.05111273 7.095793 7.095793
Instead of iterating nrow(dat)
times, we just iterate length(unique(dat$group))
times. This is significantly faster than the first for
loop. Here is the comparison (using datbig
).
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 ave 771.51µs 857.27µs 935. 1.39MB 9.76 479 5 512.55ms <NULL> <Rprofmem [42 × 3]> <bench_tm [479]> <tibble [479 × 3]>
2 dplyr 1.39ms 1.51ms 617. 1013.19KB 4.00 309 2 500.44ms <NULL> <Rprofmem [49 × 3]> <bench_tm [309]> <tibble [309 × 3]>
3 forloop 3.83s 3.83s 0.261 12.38GB 21.9 1 84 3.83s <NULL> <Rprofmem [375,000 × 3]> <bench_tm [1]> <tibble [1 × 3]>
4 forloop2 932.03µs 974.12µs 758. 5.04MB 22.0 379 11 500.28ms <NULL> <Rprofmem [142 × 3]> <bench_tm [379]> <tibble [379 × 3]>
Showing that the reduced for
loop is on-par with ave
and dplyr
.
(I can't help but see the mem_alloc
difference as well. 12GB compared with 1-5MB? Yes, that shows another very good reason why doing a per-row for
loop is not the wisest choice.)
Another alternative, if you want to squeeze some more speed out of it:
library(data.table)
DT <- as.data.table(datbig)
DT[, wt_score := sum(cost * (current_score - future_score))/sum(cost), by = .(group)]
# A tibble: 5 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 ave 754.55µs 837.44µs 1020. 1.39MB 10.0 510 5 500.1ms <NULL> <Rprofmem [42 × 3]> <bench_tm [510]> <tibble [510 × 3]>
2 dplyr 1.32ms 1.38ms 679. 1013.19KB 4.00 340 2 500.41ms <NULL> <Rprofmem [49 × 3]> <bench_tm [340]> <tibble [340 × 3]>
3 forloop 4.11s 4.11s 0.243 12.38GB 22.4 1 92 4.11s <NULL> <Rprofmem [375,007 × 3]> <bench_tm [1]> <tibble [1 × 3]>
4 forloop2 893.94µs 958.22µs 785. 5.04MB 22.0 393 11 500.33ms <NULL> <Rprofmem [142 × 3]> <bench_tm [393]> <tibble [393 × 3]>
5 datatable 442.91µs 481.71µs 1931. 326.06KB 4.00 966 2 500.22ms <NULL> <Rprofmem [17 × 3]> <bench_tm [966]> <tibble [966 × 3]>
But I do not recommend switching to data.table
spuriously: it is fast and works really well here, but it has a very different "dialect" compared to base R and dplyr. Learn to do one really well before moving on to the other. Only if your data is gargantuan (subjective and relative) would I really urge you to make this switch.
(I really do like the significantly-reduced mem_alloc
for the DT implementation ...)