Search code examples
rperformancedataframesapplyprocessing-efficiency

How do I optimize sapply in R to calculate running totals on a dataframe


I wrote a function in R to calculate cumulative totals by month number, but the execution time of my method grows exponentially as the dataset gets larger. I'm a novice R programmer, can you help me make this more efficient?
The function and the way I invoke the function:

accumulate <- function(recordnum,df){
    sumthese <- (df$subject == df$subject[recordnum]) &
        (df$month <= df$month[recordnum])
    sum(df$measurement[sumthese])
}
set.seed(42)
datalength = 10
df <- data.frame(measurement = runif(1:datalength),
                 subject=rep(c("dog","cat"),each =datalength/2),
                 month=rep(seq(datalength/2,1,by=-1)))
system.time(df$cumulative <- sapply(1:datalength,accumulate,df))

The input dataframe:

> df
   measurement subject month
1    0.4577418     dog     5
2    0.7191123     dog     4
3    0.9346722     dog     3
4    0.2554288     dog     2
5    0.4622928     dog     1
6    0.9400145     cat     5
7    0.9782264     cat     4
8    0.1174874     cat     3
9    0.4749971     cat     2
10   0.5603327     cat     1

The output dataframe:

> df
   measurement subject month cumulative
1    0.9148060     dog     5  3.6102141
2    0.9370754     dog     4  2.6954081
3    0.2861395     dog     3  1.7583327
4    0.8304476     dog     2  1.4721931
5    0.6417455     dog     1  0.6417455
6    0.5190959     cat     5  2.7524079
7    0.7365883     cat     4  2.2333120
8    0.1346666     cat     3  1.4967237
9    0.6569923     cat     2  1.3620571
10   0.7050648     cat     1  0.7050648

Notice the cumulative column shows the accumulation of all measurements up to and including the current month. The function does not require the dataframe to be sorted. When the datalength equals 100, the elapsed time is 0.3. 1000 is 0.58. 10,000 = 27.72. I need this to run for 200K+ records.
Thanks!


Solution

  • This is non-destructive, i.e. the original df is not modified. No packages are used. The original order of the rows of df is preserved; however, if that is not important then [order(o), ] on the last line can be omitted.

    o <- order(df$subject, df$month)
    transform(df[o, ], cumulative = ave(measurement, subject, FUN = cumsum))[order(o), ]
    

    giving:

       measurement subject month cumulative
    1   0.37955924     dog     5  2.2580530
    2   0.43577158     dog     4  1.8784938
    3   0.03743103     dog     3  1.4427222
    4   0.97353991     dog     2  1.4052912
    5   0.43175125     dog     1  0.4317512
    6   0.95757660     cat     5  4.0751151
    7   0.88775491     cat     4  3.1175385
    8   0.63997877     cat     3  2.2297836
    9   0.97096661     cat     2  1.5898048
    10  0.61883821     cat     1  0.6188382