Search code examples
rdplyrvectorization

How to create new column that cumulatively sums to cumulative sum of existing column?


Using tidyverse functions, I'm looking to create a new column of data that sums up to the cumulative sum in the first column, but uses increments no greater than incr.

That could start with df

incr <- 1.5
df <- data.frame(a = c(6,0,0,2.5,0,0))
df
    a
1 6.0
2 0.0
3 0.0
4 2.5
5 0.0
6 0.0

then create a new column b using incr:

    a   b
1 6.0 1.5
2 0.0 1.5
3 0.0 1.5
4 2.5 1.5
5 0.0 1.5
6 0.0 1.0

I found tidyr::uncount() and it seemed like a function that could work, but I need to use non-integer increments. Generally trying to mutate and use vectorized functions and have put some thought towards running it rowwise(), but most of my ideas require iteration.

From comment below: If analogies are helpful, think of it like queued downloads. Column a shows you press to download 6 MB at time 1, 0 MB for times 2 and 3, then you press to download 2.5 MB at time 4. Your connection, however, can only download at the speed of incr. So, if incr is 1.5, column b shows what actually downloaded. You fully use that connection speed each period until you download the final residual (1.0) in time 6.

To better highlight the dimensionality, here's another df:

incr <- 1.5
df <- data.frame(a = rep(0,100),b=rep(0,100))
df$a[c(30,33,38)] = c(6,2.5,1)
df[30:39,]
     a b
30 6.0 0
31 0.0 0
32 0.0 0
33 2.5 0
34 0.0 0
35 0.0 0
36 0.0 0
37 0.0 0
38 1.0 0
39 0.0 0

with a desired output

     a   b
30 6.0 1.5
31 0.0 1.5
32 0.0 1.5
33 2.5 1.5
34 0.0 1.5
35 0.0 1.0
36 0.0 0.0
37 0.0 0.0
38 1.0 1.0
39 0.0 0.0

Solution

  • because I don't know when to give up, I think the only solution is iteration. But I think you can always do iteration in C++ to make it as super-speedy as a vectorised function:

    Edited as more efficient function (that doesn't keep calculating fresh column sums)

    Rcpp::cppFunction(
      "
        NumericVector iterate_to_cumsum(NumericVector v1, double incr) {
        int x = v1.size();
        NumericVector v2(x);
      
        double sumv1 = 0;
        double sumv2 = 0;
      
        for (int i = 0; i < x; ++i) {
          sumv1 += v1[i];
          v2[i] = std::min(incr, sumv1 - sumv2);
          sumv2 += v2[i];
        }
    
        return v2;
    
      }
    "
    )
    
    library(tidyverse)
    
    df <- data.frame(a = c(6, 0, 0, 2.5, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0))
    incr <- 1.5
    
    df |> 
      mutate(b = iterate_to_cumsum(a, incr))
    #>      a   b
    #> 1  6.0 1.5
    #> 2  0.0 1.5
    #> 3  0.0 1.5
    #> 4  2.5 1.5
    #> 5  0.0 1.5
    #> 6  0.0 1.0
    #> 7  0.0 0.0
    #> 8  0.0 0.0
    #> 9  1.0 1.0
    #> 10 0.0 0.0
    #> 11 3.0 1.5
    #> 12 0.0 1.5
    #> 13 0.0 0.0
    #> 14 0.0 0.0
    

    An actually working Rcpp function does indeed speed things up a little (running tests as in @LMc's post):

    # @Andy Baxter
    cpp <- function(){
      df |> 
        mutate(b = iterate_to_cumsum(a, incr))
    }
    
    # @LMc
    fun_factory <- function(){
      df |>
        mutate(b = map_dbl(a, incr_cap(1.5)))
    }
    
    bench::mark(
      cpp(),
      fun_factory()
    )
    #> Warning: Some expressions had a GC in every iteration; so filtering is
    #> disabled.
    #> # A tibble: 2 × 6
    #>   expression         min   median `itr/sec` mem_alloc `gc/sec`
    #>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
    #> 1 cpp()           1.53ms   1.77ms    474.      2.15MB     8.00
    #> 2 fun_factory() 116.79ms 124.31ms      8.15  786.48KB    19.6