Using tidyverse functions, I'm looking to create a new column of data that sums up to the cumulative sum in the first column, but uses increments no greater than incr
.
That could start with df
incr <- 1.5
df <- data.frame(a = c(6,0,0,2.5,0,0))
df
a
1 6.0
2 0.0
3 0.0
4 2.5
5 0.0
6 0.0
then create a new column b
using incr
:
a b
1 6.0 1.5
2 0.0 1.5
3 0.0 1.5
4 2.5 1.5
5 0.0 1.5
6 0.0 1.0
I found tidyr::uncount()
and it seemed like a function that could work, but I need to use non-integer increments. Generally trying to mutate and use vectorized functions and have put some thought towards running it rowwise()
, but most of my ideas require iteration.
From comment below: If analogies are helpful, think of it like queued downloads. Column a
shows you press to download 6 MB at time 1, 0 MB for times 2 and 3, then you press to download 2.5 MB at time 4. Your connection, however, can only download at the speed of incr
. So, if incr
is 1.5, column b
shows what actually downloaded. You fully use that connection speed each period until you download the final residual (1.0
) in time 6.
To better highlight the dimensionality, here's another df:
incr <- 1.5
df <- data.frame(a = rep(0,100),b=rep(0,100))
df$a[c(30,33,38)] = c(6,2.5,1)
df[30:39,]
a b
30 6.0 0
31 0.0 0
32 0.0 0
33 2.5 0
34 0.0 0
35 0.0 0
36 0.0 0
37 0.0 0
38 1.0 0
39 0.0 0
with a desired output
a b
30 6.0 1.5
31 0.0 1.5
32 0.0 1.5
33 2.5 1.5
34 0.0 1.5
35 0.0 1.0
36 0.0 0.0
37 0.0 0.0
38 1.0 1.0
39 0.0 0.0
because I don't know when to give up, I think the only solution is iteration. But I think you can always do iteration in C++ to make it as super-speedy as a vectorised function:
Rcpp::cppFunction(
"
NumericVector iterate_to_cumsum(NumericVector v1, double incr) {
int x = v1.size();
NumericVector v2(x);
double sumv1 = 0;
double sumv2 = 0;
for (int i = 0; i < x; ++i) {
sumv1 += v1[i];
v2[i] = std::min(incr, sumv1 - sumv2);
sumv2 += v2[i];
}
return v2;
}
"
)
library(tidyverse)
df <- data.frame(a = c(6, 0, 0, 2.5, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0))
incr <- 1.5
df |>
mutate(b = iterate_to_cumsum(a, incr))
#> a b
#> 1 6.0 1.5
#> 2 0.0 1.5
#> 3 0.0 1.5
#> 4 2.5 1.5
#> 5 0.0 1.5
#> 6 0.0 1.0
#> 7 0.0 0.0
#> 8 0.0 0.0
#> 9 1.0 1.0
#> 10 0.0 0.0
#> 11 3.0 1.5
#> 12 0.0 1.5
#> 13 0.0 0.0
#> 14 0.0 0.0
An actually working Rcpp
function does indeed speed things up a little (running tests as in @LMc's post):
# @Andy Baxter
cpp <- function(){
df |>
mutate(b = iterate_to_cumsum(a, incr))
}
# @LMc
fun_factory <- function(){
df |>
mutate(b = map_dbl(a, incr_cap(1.5)))
}
bench::mark(
cpp(),
fun_factory()
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 cpp() 1.53ms 1.77ms 474. 2.15MB 8.00
#> 2 fun_factory() 116.79ms 124.31ms 8.15 786.48KB 19.6