Search code examples
rdatatableplyrxtszoo

Rolling mean every five months over three months


I would like to calculate rolling means, with the following specifications:

  • Start at the end of a given month, e.g. May
  • Use (daily) data from the last three months to calculate the mean over this period
  • Note: there can be missing values for some dates in a particular month and the number of days per months can vary, which makes the number of observations per calculation generally variable
  • Repeat this calculation by going forward 5 months, e.g. if in May was the last calculation, at the end of October, etc., so that the window is sliding by every 5 months and uses data for the last 3 respectively available months [Assuming data starts in March 2018, the first window would be: March-April-May 18, then Aug-Sept-Oct 18, etc.]
  • the size of the dataset/memory is important to me, as my real dataset is quite large

I searched a long time but I didn't find any clear solution, when the width parameter is variable and the window is sliding. I am especially looking for a solution in zoo. Also datatable and plyr (or xts) would be interesting for completion.

Sample data (note: There are no missing values here, because I cannot delete rows in datatable easily)

set.seed(44)  
dataset <- data.table(ID=c(rep("A",2208),rep("B",2208)),
x = c(rnorm(2208*2)), time=c(seq(as.Date("1988/03/15"),
as.Date("2000/04/16"), "day"),seq(as.Date("1988/03/15"),
as.Date("2000/04/16"), "day")))

The dataset contains data points 'x' for 2 individuals, A and B, which can be used to calculate the mean.


Solution

  • Below we use the data shown in the Note at the end, not the sample data in the question.

    1) 2 rollapply Create a year/month variable ym and then sum the values for each ID and year/month also count the number of values in each ID and year/month. Then take a rolling sum of the sums and divide that by the corresponding rolling sum of the counts doing that by ID.

    library(data.table)
    library(zoo)
    
    ym <- as.yearmon(dataset$time)
    roll <- function(x) rollapplyr(x, 3, by = 5, sum, fill = NA)
    ds <- na.omit(dataset[, list(x = sum(x), n = .N), by = list(ID, time = ym)][
     , list(time, mean = roll(x) / roll(n)), by = ID])
    

    giving:

    > ds
        ID     time         mean
     1:  A May 1988 -0.118017121
     2:  A Oct 1988 -0.045631016
     3:  A Mar 1989 -0.035498703
     4:  A Aug 1989 -0.055121507
     5:  A Jan 1990  0.018735210
     6:  A Jun 1990  0.091084791
     7:  A Nov 1990 -0.183955430
     8:  A Apr 1991  0.011909178
     9:  A Sep 1991 -0.040233435
    10:  A Feb 1992  0.051567634
    11:  A Jul 1992  0.006015941
    12:  A Dec 1992  0.253320798
    13:  A May 1993 -0.037722177
    14:  A Oct 1993 -0.145811906
    15:  A Mar 1994  0.134181429
    16:  A Aug 1994 -0.119081185
    17:  A Jan 1995  0.001921224
    18:  A Jun 1995  0.232193754
    19:  A Nov 1995 -0.077158954
    20:  A Apr 1996 -0.070271862
    21:  A Sep 1996  0.033858600
    22:  A Feb 1997 -0.053623676
    23:  A Jul 1997 -0.201388554
    24:  A Dec 1997  0.051488747
    25:  A May 1998 -0.073193772
    26:  A Oct 1998 -0.094019699
    27:  A Mar 1999 -0.078863959
    28:  A Aug 1999  0.110231533
    29:  A Jan 2000  0.141657202
    30:  B May 1988  0.130180515
    31:  B Oct 1988  0.025095818
    32:  B Mar 1989 -0.032415997
    33:  B Aug 1989  0.041286368
    34:  B Jan 1990  0.219208544
    35:  B Jun 1990 -0.023717715
    36:  B Nov 1990 -0.049073449
    37:  B Apr 1991 -0.051479646
    38:  B Sep 1991  0.124340203
    39:  B Feb 1992  0.040786822
    40:  B Jul 1992  0.019159682
    41:  B Dec 1992  0.083195470
    42:  B May 1993  0.006695704
    43:  B Oct 1993  0.119093846
    44:  B Mar 1994  0.077608445
    45:  B Aug 1994  0.132860266
    46:  B Jan 1995 -0.225050074
    47:  B Jun 1995 -0.091877628
    48:  B Nov 1995 -0.157798169
    49:  B Apr 1996 -0.219238136
    50:  B Sep 1996  0.289506566
    51:  B Feb 1997  0.118216626
    52:  B Jul 1997  0.186950994
    53:  B Dec 1997 -0.035447587
    54:  B May 1998 -0.159754318
    55:  B Oct 1998 -0.066470703
    56:  B Mar 1999  0.230782925
    57:  B Aug 1999 -0.052620748
    58:  B Jan 2000 -0.190938190
        ID     time         mean
    

    2) 1 rollapply A variation of the above is the following. It uses by.column = FALSE so that mean2 can handle both x and n at once.

    library(data.table)
    library(zoo)
    
    ym <- as.yearmon(dataset$time)
    mean2 <- function(xn) sum(xn[, 1]) / sum(xn[, 2])
    roll2 <- function(x) rollapplyr(x, 3, by = 5, mean2, by.column = FALSE, fill = NA)
    ds2 <- na.omit(dataset[, list(x = sum(x), n = .N), by = list(ID, time = ym)][
     , list(time, mean = roll2(.SD)), .SDcols = c("x", "n"), by = ID])
    

    3) vector width

    We can define a vector width and rollapply over that like this. We set the width to a number larger than the number of elements for those dates not at the end of the month in order that it not calculate a mean for those. We then calculate a mean for each end of month and in the last line of code subset it down to every 5 months.

    library(data.table)
    library(zoo)
    
    ds3 <- dataset[, list(ID, time = as.yearmon(time), x)][, 
      list(time, x, width = seq_len(.N) - match(time - 2/12, time) + 1,
           is_last = !duplicated(time, fromLast = TRUE)), by = ID][, 
      list(time, x, width = na.fill(ifelse(is_last, width, .N + 1), .N+1)), by = ID][, 
      list(time, mean = rollapplyr(x, width, mean, fill = NA_real_)), 
      by = ID][, na.omit(.SD)[seq(1, .N, 5), ], by = ID]
    

    4) data.table join This uses a data.table join instead of rollapply. eom is a data.table containing only the end of month rows. It also has a column time2 that represents the yearmon 2 months ago. We join that with datasetym and extract the appropriate rows and columns.

    library(data.table)
    library(zoo)
    
    datasetym <- dataset[, list(ID, time = as.yearmon(time), x)]
    eom <- datasetym[, .SD[!duplicated(time, fromLast = TRUE), ], by = ID][
      , cbind(.SD, time2 = time - 2/12)]
    ds4 <- datasetym[eom, list(mean = mean(x)), 
      on = .(ID, time >= time2, time <= time), by = .EACHI][
      , .SD[seq(3, .N, 5), -2], by = ID]
    

    5) sqldf You may prefer to use the more familiar SQL syntax to express the join. Creating datasetym and taking every 5th row are done as in (4).

    library(data.table)
    library(sqldf)
    library(zoo)
    
    datasetym <- dataset[, list(ID, time = as.yearmon(time), x)]
    s <- sqldf("select a.ID, a.time, avg(b.x) mean
           from (select ID, time from datasetym group by ID, time) a
           left join datasetym b
           on a.ID = b.ID and b.time between a.time - 2.0/12.0 and a.time
           group by a.ID, a.time")
    ds5 <- data.table(s)[, .SD[seq(3, .N, 5), ], by = ID]
    

    6) zoo We can solve this using only zoo if we use wide form. We can always convert back to long form afterwards if desired (as shown in commented out line).

    library(zoo)
    
    z <- read.zoo(dataset, index = "time", split = "ID")
    zsum <- aggregate(z, as.yearmon, sum)
    zlength <- aggregate(z, as.yearmon, length)
    zroll <- rollapplyr(zsum, 3, by = 5, sum) / rollapplyr(zlength, 3, by = 5, sum)
    # fortify(zroll, melt = TRUE)  # if long form wanted
    

    giving:

    > zroll
                        A            B
    May 1988 -0.118017121  0.130180515
    Oct 1988 -0.045631016  0.025095818
    Mar 1989 -0.035498703 -0.032415997
    Aug 1989 -0.055121507  0.041286368
    Jan 1990  0.018735210  0.219208544
    Jun 1990  0.091084791 -0.023717715
    Nov 1990 -0.183955430 -0.049073449
    Apr 1991  0.011909178 -0.051479646
    Sep 1991 -0.040233435  0.124340203
    Feb 1992  0.051567634  0.040786822
    Jul 1992  0.006015941  0.019159682
    Dec 1992  0.253320798  0.083195470
    May 1993 -0.037722177  0.006695704
    Oct 1993 -0.145811906  0.119093846
    Mar 1994  0.134181429  0.077608445
    Aug 1994 -0.119081185  0.132860266
    Jan 1995  0.001921224 -0.225050074
    Jun 1995  0.232193754 -0.091877628
    Nov 1995 -0.077158954 -0.157798169
    Apr 1996 -0.070271862 -0.219238136
    Sep 1996  0.033858600  0.289506566
    Feb 1997 -0.053623676  0.118216626
    Jul 1997 -0.201388554  0.186950994
    Dec 1997  0.051488747 -0.035447587
    May 1998 -0.073193772 -0.159754318
    Oct 1998 -0.094019699 -0.066470703
    Mar 1999 -0.078863959  0.230782925
    Aug 1999  0.110231533 -0.052620748
    Jan 2000  0.141657202 -0.190938190
    

    Note

    Note that dataset as defined in the question has 8832 rows but the vector used to define the ID column has only 4416 elements so it gets recycled with the result that the first 2216 dates wind up twice in A and not at all in B and the next 2216 dates wind up twice in B and not at all in A. Presumably that is not what was intended and we fix this up by replacing each occurrence of 2208 with 4416 in the definition of dataset so that each date appears once in A and once in B:

    set.seed(44)  
    dataset <- data.table(ID = c(rep("A", 4416), rep("B", 4416)),
      x = rnorm(4416 * 2), 
      time = c(seq(as.Date("1988/03/15"), as.Date("2000/04/16"), "day")))