I would like to calculate rolling means, with the following specifications:
I searched a long time but I didn't find any clear solution, when the width
parameter is variable and the window is sliding. I am especially looking for a solution in zoo
. Also datatable
and plyr
(or xts
) would be interesting for completion.
Sample data (note: There are no missing values here, because I cannot delete rows in datatable easily)
set.seed(44)
dataset <- data.table(ID=c(rep("A",2208),rep("B",2208)),
x = c(rnorm(2208*2)), time=c(seq(as.Date("1988/03/15"),
as.Date("2000/04/16"), "day"),seq(as.Date("1988/03/15"),
as.Date("2000/04/16"), "day")))
The dataset contains data points 'x' for 2 individuals, A and B, which can be used to calculate the mean.
Below we use the data shown in the Note at the end, not the sample data in the question.
1) 2 rollapply Create a year/month variable ym
and then sum the values for each ID and year/month also count the number of values in each ID and year/month. Then take a rolling sum of the sums and divide that by the corresponding rolling sum of the counts doing that by ID.
library(data.table)
library(zoo)
ym <- as.yearmon(dataset$time)
roll <- function(x) rollapplyr(x, 3, by = 5, sum, fill = NA)
ds <- na.omit(dataset[, list(x = sum(x), n = .N), by = list(ID, time = ym)][
, list(time, mean = roll(x) / roll(n)), by = ID])
giving:
> ds
ID time mean
1: A May 1988 -0.118017121
2: A Oct 1988 -0.045631016
3: A Mar 1989 -0.035498703
4: A Aug 1989 -0.055121507
5: A Jan 1990 0.018735210
6: A Jun 1990 0.091084791
7: A Nov 1990 -0.183955430
8: A Apr 1991 0.011909178
9: A Sep 1991 -0.040233435
10: A Feb 1992 0.051567634
11: A Jul 1992 0.006015941
12: A Dec 1992 0.253320798
13: A May 1993 -0.037722177
14: A Oct 1993 -0.145811906
15: A Mar 1994 0.134181429
16: A Aug 1994 -0.119081185
17: A Jan 1995 0.001921224
18: A Jun 1995 0.232193754
19: A Nov 1995 -0.077158954
20: A Apr 1996 -0.070271862
21: A Sep 1996 0.033858600
22: A Feb 1997 -0.053623676
23: A Jul 1997 -0.201388554
24: A Dec 1997 0.051488747
25: A May 1998 -0.073193772
26: A Oct 1998 -0.094019699
27: A Mar 1999 -0.078863959
28: A Aug 1999 0.110231533
29: A Jan 2000 0.141657202
30: B May 1988 0.130180515
31: B Oct 1988 0.025095818
32: B Mar 1989 -0.032415997
33: B Aug 1989 0.041286368
34: B Jan 1990 0.219208544
35: B Jun 1990 -0.023717715
36: B Nov 1990 -0.049073449
37: B Apr 1991 -0.051479646
38: B Sep 1991 0.124340203
39: B Feb 1992 0.040786822
40: B Jul 1992 0.019159682
41: B Dec 1992 0.083195470
42: B May 1993 0.006695704
43: B Oct 1993 0.119093846
44: B Mar 1994 0.077608445
45: B Aug 1994 0.132860266
46: B Jan 1995 -0.225050074
47: B Jun 1995 -0.091877628
48: B Nov 1995 -0.157798169
49: B Apr 1996 -0.219238136
50: B Sep 1996 0.289506566
51: B Feb 1997 0.118216626
52: B Jul 1997 0.186950994
53: B Dec 1997 -0.035447587
54: B May 1998 -0.159754318
55: B Oct 1998 -0.066470703
56: B Mar 1999 0.230782925
57: B Aug 1999 -0.052620748
58: B Jan 2000 -0.190938190
ID time mean
2) 1 rollapply A variation of the above is the following. It uses by.column = FALSE
so that mean2
can handle both x
and n
at once.
library(data.table)
library(zoo)
ym <- as.yearmon(dataset$time)
mean2 <- function(xn) sum(xn[, 1]) / sum(xn[, 2])
roll2 <- function(x) rollapplyr(x, 3, by = 5, mean2, by.column = FALSE, fill = NA)
ds2 <- na.omit(dataset[, list(x = sum(x), n = .N), by = list(ID, time = ym)][
, list(time, mean = roll2(.SD)), .SDcols = c("x", "n"), by = ID])
3) vector width
We can define a vector width and rollapply over that like this. We set the width to a number larger than the number of elements for those dates not at the end of the month in order that it not calculate a mean for those. We then calculate a mean for each end of month and in the last line of code subset it down to every 5 months.
library(data.table)
library(zoo)
ds3 <- dataset[, list(ID, time = as.yearmon(time), x)][,
list(time, x, width = seq_len(.N) - match(time - 2/12, time) + 1,
is_last = !duplicated(time, fromLast = TRUE)), by = ID][,
list(time, x, width = na.fill(ifelse(is_last, width, .N + 1), .N+1)), by = ID][,
list(time, mean = rollapplyr(x, width, mean, fill = NA_real_)),
by = ID][, na.omit(.SD)[seq(1, .N, 5), ], by = ID]
4) data.table join This uses a data.table join instead of rollapply. eom is a data.table containing only the end of month rows. It also has a column time2
that represents the yearmon 2 months ago. We join that with datasetym and extract the appropriate rows and columns.
library(data.table)
library(zoo)
datasetym <- dataset[, list(ID, time = as.yearmon(time), x)]
eom <- datasetym[, .SD[!duplicated(time, fromLast = TRUE), ], by = ID][
, cbind(.SD, time2 = time - 2/12)]
ds4 <- datasetym[eom, list(mean = mean(x)),
on = .(ID, time >= time2, time <= time), by = .EACHI][
, .SD[seq(3, .N, 5), -2], by = ID]
5) sqldf You may prefer to use the more familiar SQL syntax to express the join. Creating datasetym
and taking every 5th row are done as in (4).
library(data.table)
library(sqldf)
library(zoo)
datasetym <- dataset[, list(ID, time = as.yearmon(time), x)]
s <- sqldf("select a.ID, a.time, avg(b.x) mean
from (select ID, time from datasetym group by ID, time) a
left join datasetym b
on a.ID = b.ID and b.time between a.time - 2.0/12.0 and a.time
group by a.ID, a.time")
ds5 <- data.table(s)[, .SD[seq(3, .N, 5), ], by = ID]
6) zoo We can solve this using only zoo if we use wide form. We can always convert back to long form afterwards if desired (as shown in commented out line).
library(zoo)
z <- read.zoo(dataset, index = "time", split = "ID")
zsum <- aggregate(z, as.yearmon, sum)
zlength <- aggregate(z, as.yearmon, length)
zroll <- rollapplyr(zsum, 3, by = 5, sum) / rollapplyr(zlength, 3, by = 5, sum)
# fortify(zroll, melt = TRUE) # if long form wanted
giving:
> zroll
A B
May 1988 -0.118017121 0.130180515
Oct 1988 -0.045631016 0.025095818
Mar 1989 -0.035498703 -0.032415997
Aug 1989 -0.055121507 0.041286368
Jan 1990 0.018735210 0.219208544
Jun 1990 0.091084791 -0.023717715
Nov 1990 -0.183955430 -0.049073449
Apr 1991 0.011909178 -0.051479646
Sep 1991 -0.040233435 0.124340203
Feb 1992 0.051567634 0.040786822
Jul 1992 0.006015941 0.019159682
Dec 1992 0.253320798 0.083195470
May 1993 -0.037722177 0.006695704
Oct 1993 -0.145811906 0.119093846
Mar 1994 0.134181429 0.077608445
Aug 1994 -0.119081185 0.132860266
Jan 1995 0.001921224 -0.225050074
Jun 1995 0.232193754 -0.091877628
Nov 1995 -0.077158954 -0.157798169
Apr 1996 -0.070271862 -0.219238136
Sep 1996 0.033858600 0.289506566
Feb 1997 -0.053623676 0.118216626
Jul 1997 -0.201388554 0.186950994
Dec 1997 0.051488747 -0.035447587
May 1998 -0.073193772 -0.159754318
Oct 1998 -0.094019699 -0.066470703
Mar 1999 -0.078863959 0.230782925
Aug 1999 0.110231533 -0.052620748
Jan 2000 0.141657202 -0.190938190
Note that dataset
as defined in the question has 8832 rows but the vector used to define the ID column has only 4416 elements so it gets recycled with the result that the first 2216 dates wind up twice in A and not at all in B and the next 2216 dates wind up twice in B and not at all in A. Presumably that is not what was intended and we fix this up by replacing each occurrence of 2208 with 4416 in the definition of dataset so that each date appears once in A and once in B:
set.seed(44)
dataset <- data.table(ID = c(rep("A", 4416), rep("B", 4416)),
x = rnorm(4416 * 2),
time = c(seq(as.Date("1988/03/15"), as.Date("2000/04/16"), "day")))