Search code examples
rdata.tablecumsum

Cumulative sum across first element of groups in data.table


I have some time series temperature measurements taken at half hour intervals. I want to calculate an average cumulative growing degree days style metric. I am using the variable "datetime", but leaving out actual datetimes for simplicity's sake. Also don't worry about if this is actually the right calculation for growing degree days, it isn't. The following toy data emulate the challenge.

library(data.table)
#generate some approximate data.
dt<-data.table(datetime=seq(1,10.9, by=0.1),
           date=rep(1:10, each=10),
           T=floor(runif(100,1,10)))

Now I calculate a 'daily' average:

dt[,T_mean_daily:=mean(T), by=date]

Now what I want to do is calculate the cumulative sum of T_mean_daily and have it displayed in a new column but repeated for each 'datetime' on a date as with T_mean_daily. I am having some trouble visualizing that with cumsum. The final output would look like:

   datetime   date T T_mean_daily T_sum
  1:      1.0    1 4          5.6   5.6
  2:      1.1    1 6          5.6   5.6
  3:      1.2    1 9          5.6   5.6
  4:      1.3    1 7          5.6   5.6
  5:      1.4    1 3          5.6   5.6
  6:      1.5    1 8          5.6   5.6
  7:      1.6    1 3          5.6   5.6
  8:      1.7    1 7          5.6   5.6
  9:      1.8    1 8          5.6   5.6
 10:      1.9    1 1          5.6   5.6
 11:      2.0    2 2          3.6   9.2
 12:      2.1    2 5          3.6   9.2
 13:      2.2    2 4          3.6   9.2
 14:      2.3    2 1          3.6   9.2
 15:      2.4    2 9          3.6   9.2
 16:      2.5    2 5          3.6   9.2
 17:      2.6    2 2          3.6   9.2
 18:      2.7    2 5          3.6   9.2
 19:      2.8    2 2          3.6   9.2
 20:      2.9    2 1          3.6   9.2
 21:      3.0    3 1          5.9   15.1
 22:      3.1    3 4          5.9   15.1

Looking for the data.table solution. This is not the cumsum by group, I am looking for the cumsum of each first row or unique value across all groups.


Solution

  • here is another data.table approach...

    setnafill(dt[!duplicated(date), T_sum := cumsum(T_mean_daily)], "locf", cols = "T_sum")
    

    explanation

    Since we only need to use the first row of each date, we can select there rows using !duplicated(date) in the i of data.table. In the j, we can now calculate the cumulative sum of T_Mean_Daily.
    Now we are left with a column with the correct cumsum value on all first date-rows, and NA's in between, so use setnafill fo locf-fill in the value over the NA-rows in the T_sum-column.

    benchmarks

    set.seed(42)
    dt<-data.table(datetime=seq(1,10.9, by=0.1),
               date=rep(1:10, each=10),
               T=floor(runif(100,1,10)))
    dt[,T_mean_daily:=mean(T), by=date]
    microbenchmark::microbenchmark(
      r2evans = {
        test <- copy(dt)
        test[ test[, .SD[1,], by = date][, T_mean_daily := cumsum(T_mean_daily)], T_sum := i.T_mean_daily, on = .(date)]
        },
      wimpel = {
        test <- copy(dt)
        setnafill(test[!duplicated(date), T_sum := cumsum(T_mean_daily)], "locf", cols = "T_sum")
      }
    )
    
    
    
    Unit: microseconds
        expr    min      lq     mean  median      uq    max neval cld
     r2evans 3287.9 3488.20 3662.044 3560.65 3758.85 4833.1   100   b
      wimpel  425.4  437.45  465.313  451.75  485.35  608.3   100  a