Search code examples
rdplyrvectorization

Vectorize a function that requires grouping in R


Say I have some data that has multiple data points where some share a group identifier:

group <- rep(c(1:5), times=3)
cost <- rnorm(length(group), 100, 5)
current_score <- rnorm(length(group), 7, 2)
future_score <- current_score*runif(1)

dat <- data.frame(group, cost, current_score, future_score)

and a function that gives an overall weighted group score:

wt_score <- function(group, dat)
{
  one_group_dat <- dat[dat$group == group, ]
  wt_score <- sum(one_group_dat$cost * (one_group_dat$current_score - one_group_dat$future_score))/sum(one_group_dat$cost)
  return(wt_score)
}

Is there a way to vectorize the above function so that I don't have to use a loop like below? The issue is that in practice, a function is applied to tens of thousands of groups and millions of data points, so the loop is very slow.

# THIS IS TOO SLOW!
dat$wt_score <- 0
for(i in 1:nrow(dat))
{
  dat$wt_score[i] <- wt_score(dat$group[i], dat)
}

Solution

  • You don't need to have a function:

    library(dplyr) # dplyr_1.1.0
    dat %>%
      mutate(wt_score2 = sum(cost * (current_score - future_score))/sum(cost), .by = group)
    #    group      cost current_score future_score wt_score wt_score2
    # 1      1 110.95492      9.377867   0.05744175 7.745383  7.745383
    # 2      2 103.20114      4.725669   0.02894589 6.618164  6.618164
    # 3      3 100.61319      9.533577   0.05839552 7.247275  7.247275
    # 4      4  98.45976      3.680751   0.02254551 6.894018  6.894018
    # 5      5 104.48396      5.964945   0.03653676 7.095793  7.095793
    # 6      1 104.05087      6.718559   0.04115283 7.745383  7.745383
    # 7      2  98.18486      9.166696   0.05614828 6.618164  6.618164
    # 8      3  93.50934      7.118524   0.04360272 7.247275  7.247275
    # 9      4 101.75665      6.385142   0.03911057 6.894018  6.894018
    # 10     5 100.79521      7.129071   0.04366732 7.095793  7.095793
    # 11     1  92.10072      7.097935   0.04347660 7.745383  7.745383
    # 12     2 104.64702      6.212637   0.03805394 6.618164  6.618164
    # 13     3  95.33966      5.096398   0.03121670 7.247275  7.247275
    # 14     4 107.13121     10.452435   0.06402375 6.894018  6.894018
    # 15     5 102.71442      8.344599   0.05111273 7.095793  7.095793
    

    (wt_score is from your for loop, wt_score2 is identical).

    (If you're using dplyr version before 1.1, then you cannot use .by= ... remove that, and instead use dat %>% group_by(group) %>% mutate(...).)

    While you tagged , this can be done in base R as well.

    dat$wt_score3 <- ave(
      1:nrow(dat), dat$group,
      FUN = function(z) sum(dat$cost[z] * (dat$current_score[z] - dat$future_score[z]))/sum(dat$cost[z]))
    dat
    #    group      cost current_score future_score wt_score wt_score3
    # 1      1 110.95492      9.377867   0.05744175 7.745383  7.745383
    # 2      2 103.20114      4.725669   0.02894589 6.618164  6.618164
    # 3      3 100.61319      9.533577   0.05839552 7.247275  7.247275
    # 4      4  98.45976      3.680751   0.02254551 6.894018  6.894018
    # 5      5 104.48396      5.964945   0.03653676 7.095793  7.095793
    # 6      1 104.05087      6.718559   0.04115283 7.745383  7.745383
    # 7      2  98.18486      9.166696   0.05614828 6.618164  6.618164
    # 8      3  93.50934      7.118524   0.04360272 7.247275  7.247275
    # 9      4 101.75665      6.385142   0.03911057 6.894018  6.894018
    # 10     5 100.79521      7.129071   0.04366732 7.095793  7.095793
    # 11     1  92.10072      7.097935   0.04347660 7.745383  7.745383
    # 12     2 104.64702      6.212637   0.03805394 6.618164  6.618164
    # 13     3  95.33966      5.096398   0.03121670 7.247275  7.247275
    # 14     4 107.13121     10.452435   0.06402375 6.894018  6.894018
    # 15     5 102.71442      8.344599   0.05111273 7.095793  7.095793
    

    stats::ave does not take multiple arguments, so we feed it instead the row indices and handle extracting from the frame internally to the anonymous FUNction.


    Benchmarking with a small frame is premature and should not be trusted. For example, using bench::mark on the three on 15 rows, we see

    # A tibble: 3 × 13
      expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory              time               gc                  
      <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>              <list>             <list>              
    1 ave         48.27µs  56.18µs    17541.        0B     8.75  8020     4      457ms <NULL> <Rprofmem [0 × 3]>  <bench_tm [8,024]> <tibble [8,024 × 3]>
    2 dplyr        1.06ms   1.13ms      882.    4.85KB     5.35   330     2      374ms <NULL> <Rprofmem [16 × 3]> <bench_tm [332]>   <tibble [332 × 3]>  
    3 forloop    524.06µs 621.33µs     1622.        0B     8.71   745     4      459ms <NULL> <Rprofmem [0 × 3]>  <bench_tm [749]>   <tibble [749 × 3]>  
    

    where `itr/sec` (higher is better) and median (lower is better) are two good measures. However, if we go with a much larger frame,

    datbig <- bind_rows(replicate(1000, dat, simplify=FALSE))
    nrow(datbig)
    # [1] 15000
    
    bench::mark(...)
    # A tibble: 3 × 13
      expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory                   time             gc                
      <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>                   <list>           <list>            
    1 ave        764.12µs 833.87µs   971.       1.39MB     9.85   493     5   507.87ms <NULL> <Rprofmem [42 × 3]>      <bench_tm [493]> <tibble [493 × 3]>
    2 dplyr        1.34ms   1.41ms   666.    1013.19KB     4.00   333     2    500.2ms <NULL> <Rprofmem [49 × 3]>      <bench_tm [333]> <tibble [333 × 3]>
    3 forloop       3.58s    3.58s     0.280   12.38GB    21.5      1    77      3.58s <NULL> <Rprofmem [375,000 × 3]> <bench_tm [1]>   <tibble [1 × 3]>  
    

    we can see different gains. Another (big!) factor is the number of groups, where I suspect many-more-groups will change the results. (The for loop is abysmally slow here ...)


    Another alternative: iterate over the unique groups instead of each row.

    for (grp in unique(dat$group)) {
      dat$wt_score3[dat$group == grp] <- wt_score(grp, dat)
    }
    dat
    #    group      cost current_score future_score wt_score wt_score3
    # 1      1 110.95492      9.377867   0.05744175 7.745383  7.745383
    # 2      2 103.20114      4.725669   0.02894589 6.618164  6.618164
    # 3      3 100.61319      9.533577   0.05839552 7.247275  7.247275
    # 4      4  98.45976      3.680751   0.02254551 6.894018  6.894018
    # 5      5 104.48396      5.964945   0.03653676 7.095793  7.095793
    # 6      1 104.05087      6.718559   0.04115283 7.745383  7.745383
    # 7      2  98.18486      9.166696   0.05614828 6.618164  6.618164
    # 8      3  93.50934      7.118524   0.04360272 7.247275  7.247275
    # 9      4 101.75665      6.385142   0.03911057 6.894018  6.894018
    # 10     5 100.79521      7.129071   0.04366732 7.095793  7.095793
    # 11     1  92.10072      7.097935   0.04347660 7.745383  7.745383
    # 12     2 104.64702      6.212637   0.03805394 6.618164  6.618164
    # 13     3  95.33966      5.096398   0.03121670 7.247275  7.247275
    # 14     4 107.13121     10.452435   0.06402375 6.894018  6.894018
    # 15     5 102.71442      8.344599   0.05111273 7.095793  7.095793
    

    Instead of iterating nrow(dat) times, we just iterate length(unique(dat$group)) times. This is significantly faster than the first for loop. Here is the comparison (using datbig).

    # A tibble: 4 × 13
      expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory                   time             gc                
      <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>                   <list>           <list>            
    1 ave        771.51µs 857.27µs   935.       1.39MB     9.76   479     5   512.55ms <NULL> <Rprofmem [42 × 3]>      <bench_tm [479]> <tibble [479 × 3]>
    2 dplyr        1.39ms   1.51ms   617.    1013.19KB     4.00   309     2   500.44ms <NULL> <Rprofmem [49 × 3]>      <bench_tm [309]> <tibble [309 × 3]>
    3 forloop       3.83s    3.83s     0.261   12.38GB    21.9      1    84      3.83s <NULL> <Rprofmem [375,000 × 3]> <bench_tm [1]>   <tibble [1 × 3]>  
    4 forloop2   932.03µs 974.12µs   758.       5.04MB    22.0    379    11   500.28ms <NULL> <Rprofmem [142 × 3]>     <bench_tm [379]> <tibble [379 × 3]>
    

    Showing that the reduced for loop is on-par with ave and dplyr.

    (I can't help but see the mem_alloc difference as well. 12GB compared with 1-5MB? Yes, that shows another very good reason why doing a per-row for loop is not the wisest choice.)

    Another alternative, if you want to squeeze some more speed out of it:

    library(data.table)
    DT <- as.data.table(datbig)
    DT[, wt_score := sum(cost * (current_score - future_score))/sum(cost), by = .(group)]
    
    # A tibble: 5 × 13
      expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory                   time             gc                
      <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>                   <list>           <list>            
    1 ave        754.55µs 837.44µs  1020.       1.39MB    10.0    510     5    500.1ms <NULL> <Rprofmem [42 × 3]>      <bench_tm [510]> <tibble [510 × 3]>
    2 dplyr        1.32ms   1.38ms   679.    1013.19KB     4.00   340     2   500.41ms <NULL> <Rprofmem [49 × 3]>      <bench_tm [340]> <tibble [340 × 3]>
    3 forloop       4.11s    4.11s     0.243   12.38GB    22.4      1    92      4.11s <NULL> <Rprofmem [375,007 × 3]> <bench_tm [1]>   <tibble [1 × 3]>  
    4 forloop2   893.94µs 958.22µs   785.       5.04MB    22.0    393    11   500.33ms <NULL> <Rprofmem [142 × 3]>     <bench_tm [393]> <tibble [393 × 3]>
    5 datatable  442.91µs 481.71µs  1931.     326.06KB     4.00   966     2   500.22ms <NULL> <Rprofmem [17 × 3]>      <bench_tm [966]> <tibble [966 × 3]>
    

    But I do not recommend switching to data.table spuriously: it is fast and works really well here, but it has a very different "dialect" compared to base R and dplyr. Learn to do one really well before moving on to the other. Only if your data is gargantuan (subjective and relative) would I really urge you to make this switch.

    (I really do like the significantly-reduced mem_alloc for the DT implementation ...)